BFV FHE Performance

Version: 1.0.0
Status: Production
Last Updated: 2026-05-23
Hardware: AWS c8g.metal-48xl (Graviton4, 192 vCPU, 371 GiB)
FHE Scheme: Brakerski/Fan-Vercauteren (BFV)
Canonical URL: https://h33.ai/benchmarks/bfv/

1. Scope

This document presents production benchmarks for the BFV fully homomorphic encryption scheme as implemented in the H33 cryptographic pipeline. Measurements span five security tiers (H0 through H-256) and cover encrypt, decrypt, multiply, and inner-product operations. All numbers are from Graviton4 bare metal under sustained load.

2. Definitions

BFV (Brakerski/Fan-Vercauteren): An integer-arithmetic FHE scheme supporting SIMD-style batching via the Chinese Remainder Theorem. Operations are performed on vectors of plaintext integers packed into a single ciphertext.
SIMD Batch: The technique of encoding multiple independent plaintext values into a single ciphertext polynomial using the CRT decomposition. For N=4096, up to 4096 independent values can be packed; H33 uses 32-user batches for biometric authentication.
Noise Budget: The remaining capacity for homomorphic operations before decryption fails. Each multiplication consumes noise budget. When the budget reaches zero, the ciphertext can no longer be decrypted correctly.
Inner Product: A batched homomorphic operation computing the element-wise multiply of two ciphertexts followed by a rotation-and-sum to produce a scalar result. Used in biometric matching (computing cosine similarity on encrypted feature vectors).

3. Parameter Table

Tier	N	Q (bits)	t	Security (classical)	Multiplicative Depth	Use Case
`H0`	2,048	40	65,537	112-bit	1	Development / testing
`H1`	4,096	56	65,537	128-bit	1	Biometric auth (production)
`H33`	4,096	56	65,537	128-bit	1	H33-128 (production default)
`H2`	8,192	109	65,537	128-bit	3	Multi-hop computation
`H-256`	16,384	218	65,537	192-bit	7	H33-256 deep circuits

The production authentication pipeline uses the H33 tier (N=4096, single 56-bit modulus, t=65537). This is designated biometric_fast() in the codebase. The H-256 tier is reserved for deep computation workflows requiring 7+ multiplicative levels.

4. Per-Tier Benchmarks

4.1. Encrypt / Decrypt Latency

Tier	Encrypt (median)	Decrypt (median)	CT Size
`H0`	48 us	18 us	10 KiB
`H1`	102 us	38 us	32 KiB
`H33`	102 us	38 us	32 KiB
`H2`	310 us	105 us	142 KiB
`H-256`	1,180 us	390 us	570 KiB

4.2. Multiply Latency

Tier	CT x CT (median)	CT x PT (median)	Relinearize
`H0`	62 us	22 us	41 us
`H1`	198 us	64 us	128 us
`H33`	198 us	64 us	128 us
`H2`	810 us	242 us	520 us
`H-256`	3,400 us	980 us	2,100 us

4.3. Inner Product (Biometric Match)

Tier	Vector Length	Inner Product (median)	Users per CT
`H0`	128	320 us	16
`H1`	128	943 us	32
`H33`	128	943 us	32
`H2`	256	4,200 us	32
`H-256`	512	18,600 us	32

5. SIMD Batch Throughput

The H33 tier packs 32 independent user authentications into a single ciphertext. The following table shows the end-to-end FHE stage (Stage 1) throughput for batch biometric authentication.

Metric	Value
Batch size	32 users per ciphertext
Batch FHE latency	943 us
Per-user FHE latency	29.5 us
Single-core throughput	33,934 users/sec
192-core throughput (sustained 30s)	1,667,875 auth/sec (full pipeline)

The 1,667,875 auth/sec figure includes the complete pipeline (FHE + attestation + ZKP), not FHE alone. FHE (Stage 1) accounts for 70% of the pipeline latency. See the Agent Infrastructure page for the full pipeline breakdown.

6. Graviton4 Scaling

BFV inner-product scaling across cores (H33 tier, batch of 32):

Cores	Batches/sec	Users/sec	Efficiency
1	1,060	33,934	100%
16	16,800	537,600	99.1%
48	50,100	1,603,200	98.6%
96	99,400	3,180,800	97.7%
192	195,600	6,259,200	96.2%

Near-linear scaling is achieved because each FHE batch is independent (no shared ciphertext state). The slight efficiency drop at 192 cores (96.2%) is attributable to L3 cache contention on the 56-bit modulus NTT butterfly operations.

7. Montgomery Radix-4 Optimization

The H33 BFV implementation uses a Montgomery radix-4 NTT with Harvey lazy reduction. This optimization reduces the number of modular reductions per butterfly from 2 to approximately 0.5, yielding a 1.8x speedup over the textbook radix-2 NTT at N=4096.

NTT Variant	N=4096 Forward (median)	Reductions per Butterfly
Radix-2 (textbook)	14.2 us	2.0
Radix-4 (Harvey lazy)	7.9 us	~0.5

8. Reproducibility

# Build with production flags
$ cargo build --release --features bfv_bench

# Run per-tier benchmarks
$ ./target/release/examples/bfv_bench --tier h33 --iterations 100000

# Run SIMD batch throughput
$ ./target/release/examples/graviton4_bench --batch-size 32 --duration 30

# Run scaling test
$ ./target/release/examples/bfv_bench --tier h33 --threads 192 --iterations 10000

Do NOT use --features parallel for BiometricFast builds. It enables Rayon work-stealing, which causes 37% throughput regression from contention at 96+ workers. Use OS-level thread pinning instead.

Related Benchmarks & Specifications

Benchmarks Index ML-DSA-65 TFHE Gates STARK Proofs Agent Infrastructure H33-128 H33-256 FHE Overview