BFV vs CKKS: Choosing the Right FHE Scheme for Your Application

Fully Homomorphic Encryption lets you compute on encrypted data without ever decrypting it. That single sentence sounds simple, but the moment you sit down to build something real, the first fork in the road stops you cold: which FHE scheme should I use?

The answer is not academic. Choosing the wrong scheme means either sacrificing correctness (using approximate arithmetic where exact results are required) or leaving performance on the table (using exact arithmetic for workloads that are naturally approximate). For H33, the choice directly determines whether we can authenticate 1.2 million users per second or not.

The modern FHE landscape includes four major scheme families:

BGV (Brakerski-Gentry-Vaikuntanathan) — exact integer arithmetic with modulus switching. The intellectual ancestor of BFV.
BFV (Brakerski/Fan-Vercauteren) — exact integer arithmetic with scale-invariant noise management. The scheme H33 uses in production.
CKKS (Cheon-Kim-Kim-Song) — approximate arithmetic on real and complex numbers. The dominant scheme for machine learning inference.
TFHE (Torus FHE) — gate-by-gate Boolean evaluation with fast bootstrapping. Best for arbitrary Boolean circuits and comparisons.

This guide focuses on BFV and CKKS because they are the two most widely deployed schemes for SIMD-batched workloads. They share the same underlying Ring Learning With Errors (RLWE) hardness assumption and similar parameter structures, but they encode data in fundamentally different ways, which makes them optimal for fundamentally different tasks.

The One-Sentence Decision

If your computation requires exact results — biometric matching, database equality checks, vote counting, financial ledgers — use BFV. If your computation tolerates controlled approximation — ML inference, statistical aggregation, signal processing — use CKKS.

The Fundamental Distinction: Exact vs. Approximate Arithmetic

Every FHE scheme encrypts data by adding carefully calibrated noise to a mathematical structure. This noise is what provides security — without it, the ciphertext would be trivially invertible. The defining question for any FHE scheme is: what happens to this noise during computation, and how does it affect the result?

BFV: Noise is an Obstacle to Remove

In BFV, the plaintext lives in a discrete integer ring Z_t[x] / (x^N + 1), where t is the plaintext modulus and N is the polynomial degree. The encryption process embeds the plaintext into a much larger ciphertext modulus Q, with noise added on top. During decryption, the noise is stripped away and the exact plaintext is recovered — provided the noise has not grown beyond the capacity of the scheme.

This means BFV computations are exact. If you encrypt the integer 42, perform a series of additions and multiplications, and decrypt the result, you get the mathematically precise answer modulo t. No rounding. No truncation. No precision loss. The noise is entirely internal to the scheme and invisible to the application, as long as you stay within the noise budget.

CKKS: Noise is Part of the Answer

CKKS takes a radically different approach. Instead of treating noise as something to be eliminated, CKKS treats noise as controlled approximation error that is tolerable for the application. Real numbers are encoded by scaling them up to integers, encrypting those integers, and accepting that each homomorphic operation introduces a small amount of additional imprecision — analogous to floating-point rounding error in conventional computing.

After a CKKS computation, the decrypted result is an approximation of the true answer. If you encrypt 3.14159 and multiply it by 2.0, you might get back 6.28318 plus or minus some small epsilon (say, 6.283180000000003). For ML inference, statistical aggregation, and signal processing, this is perfectly acceptable. For biometric matching where a single bit flip could change a match verdict, it is not.

Why This Matters in Practice

Consider a biometric cosine similarity threshold of 0.85. With BFV, the encrypted inner product yields an exact integer result that maps deterministically to a match or no-match decision. With CKKS, the result might be 0.8499997 or 0.8500003 — and the difference between those two values is the difference between granting and denying access. For authentication, exact arithmetic is not a luxury. It is a correctness requirement.

BFV Deep Dive

Mathematical Foundation

BFV is built on the Ring Learning With Errors (RLWE) problem, which operates in the polynomial ring R_Q = Z_Q[x] / (x^N + 1), where N is a power of two and Q is the ciphertext modulus. The security of the scheme reduces to the hardness of distinguishing RLWE samples from uniform random samples — a problem that is believed to be hard even for quantum computers, since it reduces to worst-case lattice problems (specifically, the Shortest Vector Problem on ideal lattices).

The key generation, encryption, and decryption procedures are as follows:

Key Generation

Sample a secret key s from a ternary distribution over R_Q (coefficients in {-1, 0, 1}). Sample a uniformly random polynomial a from R_Q and a small error polynomial e from a discrete Gaussian (or centered binomial) distribution. The public key is the pair pk = (pk0, pk1) = (-a*s + e, a). The secret key is s.

Encryption

To encrypt a plaintext polynomial m in R_t, sample a random polynomial u from the ternary distribution and two small error polynomials e1, e2. Compute:

ct = (pk0*u + e1 + floor(Q/t)*m, pk1*u + e2)

The term floor(Q/t)*m is the scaled plaintext. The errors e1 and e2 provide the RLWE security guarantee. The result is a pair of polynomials (c0, c1) in R_Q.

Decryption

Compute c0 + c1*s (mod Q), then scale by t/Q and round to the nearest integer modulo t. If the accumulated noise is below the threshold Q/(2t), the rounding recovers the exact plaintext m. This is the invariant that makes BFV exact: the noise occupies a fraction of the ciphertext space, and the scaling plus rounding eliminates it completely.

Noise Growth Analysis

Understanding noise growth is critical for parameter selection. In BFV, the noise budget is consumed at different rates depending on the operation:

Addition — noise grows additively. If ciphertext A has noise v_A and ciphertext B has noise v_B, the sum has noise roughly v_A + v_B. This is cheap. You can perform thousands of additions before exhausting the budget.
Plaintext multiplication — noise grows proportionally to the plaintext magnitude. Multiplying by a plaintext with small coefficients is nearly free; multiplying by a plaintext with large coefficients is expensive.
Ciphertext multiplication — noise grows multiplicatively. If both ciphertexts have noise ~v, the product has noise roughly v^2. This is the dominant noise consumer and limits the multiplicative depth of any BFV computation.
Relinearization — after ciphertext multiplication, the result is a degree-2 ciphertext (three polynomials). Relinearization converts it back to degree 1 (two polynomials) using evaluation keys, at the cost of a small additive noise increase.
Galois rotation — permuting SIMD slots requires a key-switching operation similar to relinearization, adding comparable noise.

H33 Production Insight

For biometric matching, H33 needs exactly one level of plaintext-ciphertext multiplication (the inner product between the encrypted probe template and the stored enrolled template) followed by accumulation (additions). This single multiplicative depth means we can use a minimal ciphertext modulus — a single 56-bit prime Q — which dramatically reduces computation time. Deeper circuits would require a larger Q or modulus chain, increasing latency proportionally.

Plaintext Modulus Selection: Why t = 65537

The plaintext modulus t determines the range of integers you can represent and the structure of the SIMD batching slots. For SIMD batching to work, t must satisfy a specific algebraic condition: t must be a prime such that t ≡ 1 (mod 2N). This ensures that the polynomial x^N + 1 splits completely modulo t, giving you N independent plaintext slots (or N divided by the order of the Galois group, depending on the splitting pattern).

For N = 4096, we need t ≡ 1 (mod 8192). The smallest such prime is t = 65537 = 2^16 + 1, which is also a Fermat prime. This choice gives us:

4096 SIMD slots for parallel computation
16-bit dynamic range per slot (values 0 through 65536)
Efficient NTT decomposition since 65537 has a primitive root of unity of order 8192
Sufficient precision for 128-dimensional biometric vectors quantized to 8-bit or 10-bit values

SIMD Batching via CRT

SIMD (Single Instruction, Multiple Data) batching is what makes BFV practical at scale. Rather than encrypting one integer per ciphertext, we exploit the Chinese Remainder Theorem (CRT) isomorphism to pack N independent values into a single ciphertext, and every homomorphic operation applies simultaneously to all N slots.

When x^N + 1 splits modulo t into N linear factors, the plaintext ring Z_t[x]/(x^N + 1) is isomorphic to Z_t^N via the CRT map. In concrete terms: a single polynomial encodes 4096 independent integers, and a single BFV addition or multiplication operates on all 4096 in parallel.

For biometric authentication, each user's template is a 128-dimensional vector. We pack 4096 / 128 = 32 user templates into a single ciphertext. One encrypted inner-product operation processes all 32 users simultaneously, which is why H33 measures performance in batches of 32:

SIMD Batching Architecture

Polynomial degree (N)4,096

SIMD slots4,096

Template dimensions128

Users per ciphertext32

Template storage reduction128x

The key insight is that SIMD batching is constant-time with respect to batch size. Processing 1 user or 32 users takes essentially the same ~1,375 microseconds for the FHE batch operation. Below 32 users, some slots are simply unused. This is why throughput scales linearly with worker count but is constant per batch — and why H33 achieves 1.2 million authentications per second on a 96-core Graviton4 instance.

CKKS Deep Dive

The Approximate Arithmetic Model

CKKS, introduced by Cheon, Kim, Kim, and Song in 2017, was designed from the ground up for approximate computation on real and complex numbers. Instead of the integer encoding used by BFV, CKKS uses a scaling factor (often called Δ) to map floating-point values into the integer polynomial ring.

To encode a vector of real numbers (r_1, r_2, ..., r_{N/2}), CKKS first maps them to complex numbers on the canonical embedding (using the inverse DFT), then scales the result by Δ and rounds to the nearest integer polynomial. The scaling factor Δ determines the precision: a larger Δ means more bits of precision but also a larger ciphertext modulus requirement.

The encoding is:

Start with a vector of N/2 complex values (real values are embedded as complex with zero imaginary part)
Apply the inverse canonical embedding (essentially an inverse DFT specialized to the cyclotomic structure)
Scale by Δ and round to the nearest integer polynomial in R_Q
Encrypt using standard RLWE encryption (same as BFV structurally)

The Rescaling Mechanism

The critical innovation in CKKS is rescaling. When you multiply two CKKS ciphertexts, the scaling factor squares: if both inputs have scale Δ, the product has scale Δ^2. Left unchecked, the scale would grow exponentially with multiplicative depth, quickly exceeding the ciphertext modulus.

Rescaling divides the ciphertext by one factor of Δ (by dividing both polynomials by the smallest prime in the modulus chain and dropping that prime). This brings the scale back down to Δ and simultaneously reduces the ciphertext modulus from Q = q_1 * q_2 * ... * q_L to Q' = q_1 * q_2 * ... * q_{L-1}.

Each rescaling operation consumes one level of the modulus chain. The total number of available levels L determines the maximum multiplicative depth of the computation. This is fundamentally similar to how BFV manages noise through modulus switching, but in CKKS the rescaling is mandatory after every multiplication to maintain numerical stability.

Precision vs. Depth Trade-Off

Every CKKS rescaling step introduces a small additional approximation error, roughly 1/Δ. After L levels of multiplication, the accumulated error is approximately L/Δ. For deep circuits (many sequential multiplications), you need a larger Δ to maintain precision, which requires a larger modulus Q, which increases ciphertext sizes and computation time. This is the fundamental precision-depth trade-off in CKKS.

Encoding Real Numbers to Polynomial Rings

The CKKS encoding leverages a beautiful algebraic structure. The ring Z[x]/(x^N+1) has N complex roots of unity (the primitive 2N-th roots of unity), and the canonical embedding maps a polynomial to its evaluation at these roots. For CKKS, we use only N/2 of these roots (the others are conjugates), giving us N/2 independent complex-valued slots.

In practice, this means:

A CKKS ciphertext with N = 4096 provides 2,048 complex slots (or 2,048 real-valued slots if you only use the real part)
Each slot can hold a floating-point value with precision determined by Δ
Homomorphic operations apply independently to all slots (SIMD parallelism, same as BFV)
Rotations shift values between slots (useful for sum-reduction patterns in ML inference)

When Approximate Is Good Enough

CKKS shines in domains where the input data itself is inherently approximate. Neural network weights are typically stored as float32 or float16. Sensor measurements have noise floors. Statistical aggregations are reported to a fixed number of significant digits. In all these cases, the controlled approximation error introduced by CKKS is smaller than the inherent uncertainty in the data — making exact arithmetic unnecessarily expensive.

Specific scenarios where CKKS's approximate model is a natural fit:

Neural network inference — ReLU activations can be approximated by low-degree polynomials, and weight precision rarely exceeds 16 bits
Logistic regression — sigmoid function is smooth and well-approximated by polynomial series
Statistical aggregation — mean, variance, and correlation computations where 6-8 digits of precision suffice
Signal processing — FFT, filtering, and convolution on sampled signals
Medical data analysis — genomic similarity scores, drug interaction predictions

Head-to-Head Comparison

The following table compares BFV and CKKS across the dimensions that matter for production deployment. Parameters are normalized to N = 4096 and 128-bit security for a direct comparison.

Dimension	BFV	CKKS
Arithmetic type	Exact integers mod `t`	Approximate real/complex
Plaintext encoding	CRT (integer slots)	Canonical embedding (complex slots)
SIMD slots (N=4096)	4,096	2,048 (complex) or 4,096 (real)
Noise management	Modulus switching (optional)	Rescaling (mandatory after multiply)
Noise model	Noise removed at decrypt	Noise is approximation error
Addition cost	~1µs (trivial)	~1µs (trivial)
Multiplication cost	~50-200µs (depends on relin)	~40-180µs + rescale
Rotation cost	~100-300µs (key-switch)	~100-300µs (key-switch)
Ciphertext size (1 level)	~32 KB (N=4096, Q=56-bit)	~64 KB (N=4096, Q=109-bit)
Bootstrapping	Supported but rarely needed	Supported, more practical
Bootstrap overhead	~100ms+ (high cost)	~10-50ms (more efficient)
Comparison operations	Native (integer comparison)	Requires polynomial approximation
Quantum security	Yes (RLWE / lattice)	Yes (RLWE / lattice)

Key Takeaway

The performance differences between BFV and CKKS for basic operations (add, multiply, rotate) are relatively small. The major differences are in what those operations mean — exact vs. approximate — and in the parameter efficiency for different workload types. BFV with a single-level modulus is extremely compact. CKKS with deep modulus chains can grow large.

Use Case Decision Matrix

The choice between BFV and CKKS should be driven by the nature of the computation, not by performance benchmarks alone. Here is a detailed decision matrix with rationale for each domain.

Biometric Matching → BFV

Biometric authentication computes cosine similarity or Euclidean distance between encrypted templates. The match/no-match decision is binary and depends on an exact threshold comparison. Even a tiny approximation error could flip the decision. H33 uses BFV for this exact reason.

ML Inference → CKKS

Neural network inference involves matrix multiplications and polynomial-approximated activation functions. Model weights are inherently approximate (float32 or quantized int8), and inference outputs are probabilities that tolerate small errors. CKKS's native floating-point semantics map directly to this workload.

Database Queries → BFV

Private database queries (equality checks, range queries, keyword search) require exact matching. A query for "age = 25" must return exactly the records where age equals 25, not records where age is approximately 25. BFV's exact integer arithmetic handles this naturally.

Statistical Analysis → CKKS

Computing mean, variance, standard deviation, or correlation coefficients across encrypted datasets. These are inherently approximate computations — reporting a mean to 8 significant digits is more than sufficient. CKKS avoids the overhead of exact arithmetic for this class of workload.

Electronic Voting → BFV

Vote tallying must be exact. A vote is either cast or not cast. There is no "approximately 1 vote." BFV's integer arithmetic maps perfectly to binary or multi-candidate ballot encoding, and the exact addition guarantees a correct tally.

Signal Processing → CKKS

FFT, filtering, convolution, and spectral analysis operate on sampled real-valued signals that are inherently band-limited and noisy. CKKS's canonical embedding is structurally related to the DFT, making the encoding and computation particularly natural and efficient.

Financial Ledgers → BFV

Account balances, transaction amounts, and audit trails must be exact to the cent. Approximate arithmetic on financial data would introduce rounding discrepancies that compound across millions of transactions. BFV's modular arithmetic guarantees exact accounting.

Genomic Analysis → CKKS

Genome-wide association studies (GWAS) compute statistical correlations across hundreds of thousands of SNPs. The input data has inherent measurement noise, and the results are p-values and effect sizes that are meaningful only to a few significant figures. CKKS is the natural choice.

H33's Production BFV Configuration

H33's entire authentication pipeline runs on a single BFV configuration, tuned for minimum latency at 128-bit security. Every parameter choice has a specific rationale, and the configuration has been locked since February 2026 after extensive benchmarking on AWS Graviton4 (c8g.metal-48xl, 96 Neoverse V2 cores).

Why BFV for Authentication

The core operation in biometric template matching is an inner product between an encrypted probe template and a stored enrolled template. This inner product is then compared against a threshold to produce a match/no-match decision. The requirements are:

Exact inner product — the distance computation must be bit-perfect to ensure deterministic match verdicts
Integer-domain computation — biometric feature vectors are quantized to 8-10 bit integers during enrollment
Single multiplicative depth — the inner product is a series of plaintext-ciphertext multiplications followed by additions, requiring only depth 1
Maximum SIMD throughput — batching as many users as possible per ciphertext to amortize the NTT and key-switching overhead

CKKS could theoretically compute an inner product, but the approximation error would require a larger security margin in the threshold comparison, effectively wasting bits of precision to compensate for noise that BFV simply does not have. When the computation is naturally integer-valued and shallow, BFV is strictly superior.

Parameter Choices and Rationale

Production Parameters

Polynomial degree (N)4,096

Ciphertext modulus (Q)56-bit prime

Plaintext modulus (t)65,537

Security level128-bit

Multiplicative depth1

Number of moduli1 (single Q)

SIMD slots4,096

Users per ciphertext32

Why N = 4096? This is the minimum polynomial degree that provides 128-bit security with a 56-bit modulus. Increasing to N = 8192 would give us more slots and a deeper noise budget, but would double the NTT cost for no benefit — we only need depth 1.

Why a single 56-bit modulus? With multiplicative depth 1, we do not need a modulus chain. A single prime Q means no modulus switching overhead, no RNS (Residue Number System) decomposition, and ciphertext sizes that fit comfortably in L2 cache. This is the smallest Q that provides 128-bit security at N = 4096 while leaving enough noise budget for one multiplication plus a handful of additions.

Why t = 65537? As discussed above, this is the smallest prime satisfying t ≡ 1 (mod 8192) for full SIMD batching with N = 4096. It provides 16 bits of dynamic range per slot, more than enough for quantized biometric features.

Montgomery NTT Optimization

The Number Theoretic Transform is the computational bottleneck of any BFV implementation. Every encryption, decryption, and multiplication requires forward and inverse NTTs on degree-4096 polynomials. H33's NTT implementation uses several techniques that collectively eliminate all division from the hot path:

Montgomery multiplication — all modular multiplications are performed in Montgomery form, replacing division by Q with a shift and multiply by the precomputed Montgomery inverse. This converts expensive division into cheap bit manipulation.
Radix-4 butterfly — processes four elements per butterfly stage instead of two (radix-2), reducing the number of stages from log2(N) to log4(N) and cutting loop overhead in half.
Harvey lazy reduction — between butterfly stages, intermediate values are kept in the range [0, 2Q) rather than being fully reduced to [0, Q). The final reduction happens only once at the end of the transform. This eliminates roughly half of the conditional subtractions in a standard NTT.
Twiddle factors in Montgomery form — all NTT twiddle factors are precomputed and stored in Montgomery representation at key generation time, eliminating runtime conversions.
Montgomery domain persistence — ciphertexts remain in Montgomery form between operations when possible, avoiding redundant conversion round-trips.

Rust ntt.rs (simplified)

/// Forward NTT with Montgomery arithmetic and Harvey lazy reduction.
/// Twiddles are pre-stored in Montgomery form. No division in the hot path.
pub fn forward_ntt_mont(
    data: &mut [u64],
    twiddles_mont: &[u64],
    q: u64,
    q_inv: u64,   // Montgomery inverse: -q^{-1} mod 2^64
    two_q: u64,   // 2*q for lazy reduction bound checks
) {
    let n = data.len();
    let mut t = n;
    let mut tw_idx = 0;

    while t > 1 {
        t >>= 1;
        for i in (0..n).step_by(2 * t) {
            let w = twiddles_mont[tw_idx];
            tw_idx += 1;
            for j in i..i + t {
                // Harvey butterfly: values stay in [0, 2q)
                let u = data[j];
                let v = mont_mul(data[j + t], w, q, q_inv);
                data[j]     = u + v - (two_q & mask_if_ge(u + v, two_q));
                data[j + t] = u - v + (two_q & mask_if_ge(v, u));
            }
        }
    }
    // Final reduction: bring all values from [0, 2q) to [0, q)
    for x in data.iter_mut() {
        *x -= q & mask_if_ge(*x, q);
    }
}

Benchmark Data

All measurements on c8g.metal-48xl (96 cores, AWS Graviton4, Neoverse V2), system allocator, Criterion.rs v0.5, February 2026.

Production Throughput

FHE batch (32 users)~1,375µs

ZKP STARK lookup~0.067µs

Dilithium attestation~240µs

Total 32-user batch~1,615µs

Per authentication~50µs

Sustained throughput (96 workers)~1.2M auth/sec

The FHE batch dominates the pipeline at ~85% of total latency. The ZKP and Dilithium attestation stages are amortized across the batch (one proof and one signature per 32 users), making them negligible per-authentication. This is why BFV parameter optimization — and specifically NTT optimization — is the single most important performance lever in the entire H33 stack. See FHE Performance Optimization for the full optimization journey.

Parameter Selection Guidelines

Choosing FHE parameters is an exercise in balancing four competing constraints: security level, computation depth, performance, and ciphertext size. The FHE Parameter Selection Guide covers this in exhaustive detail. Here we summarize the key considerations for BFV and CKKS.

BFV Parameter Selection

Determine your multiplicative depth. Count the maximum number of sequential ciphertext-ciphertext multiplications in your circuit. For inner products with plaintext operands, the depth is 1. For matrix multiplications, it may be 2-4.
Choose N. Start with the smallest power of two that provides your target security level (typically 128 bits) given the modulus size you will need. Use the Homomorphic Encryption Standard security tables.
Choose Q. The ciphertext modulus must be large enough to accommodate the initial noise plus the noise growth from your computation. For depth 1 with N=4096, a single 56-bit prime suffices. For depth 4+, you will need a modulus chain of 3-8 primes totaling 200-400 bits.
Choose t. For SIMD batching, pick the smallest prime satisfying t ≡ 1 (mod 2N) that provides sufficient dynamic range for your plaintext values.
Validate security. Use the lattice-estimator tool to confirm that your (N, Q) pair achieves the target security level against known lattice attacks (primal uSVP, dual, hybrid).

CKKS Parameter Selection

Determine your precision requirement. How many bits of precision do you need in the final result? This determines the minimum scale Δ, typically 2^{30} to 2^{60}.
Determine your multiplicative depth. Same as BFV, but remember that each multiplication requires a mandatory rescaling step that consumes one modulus level.
Build the modulus chain. You need L + 1 primes: L primes of size ~log2(Δ) bits for the computation levels, plus one special prime for the initial encryption level. The total modulus Q is the product of all primes.
Choose N. The polynomial degree must provide 128-bit security for the total modulus size. Since CKKS modulus chains tend to be larger than BFV single-level moduli, CKKS typically requires N = 8192 or 16384 for moderate computation depths.
Validate precision. Run your computation on test data and measure the actual output error against the expected result. Adjust Δ upward if precision is insufficient.

Common Mistake

Do not over-provision parameters. A common error is choosing N = 16384 "just to be safe" when N = 4096 would suffice. The NTT cost scales as O(N log N), so doubling N roughly doubles the latency. Similarly, using a modulus chain of 8 primes when your circuit only needs depth 2 wastes memory and bandwidth on unused ciphertext capacity. Profile your actual computation depth, then choose the minimum parameters that satisfy it.

Performance Optimization Techniques

Both BFV and CKKS share the same computational bottlenecks: NTT, polynomial multiplication, and key-switching. The FHE Performance Optimization guide covers these in detail. Here we highlight the techniques most relevant to scheme selection.

NTT-Domain Operations

The Number Theoretic Transform converts polynomial multiplication from O(N^2) to O(N log N). But beyond this asymptotic improvement, keeping operands in the NTT domain between operations avoids redundant forward/inverse transforms. In H33's BFV pipeline:

Secret key stored in NTT form — forward NTT computed once at key generation, never repeated
Public key pk0 stored in NTT form — eliminates a clone + forward NTT per encryption
Enrolled templates stored in NTT form — the forward NTT for each template is computed once at enrollment, skipped during every subsequent authentication
Fused inner product — pointwise multiply-accumulate across all 128 dimensions in NTT domain, with a single inverse NTT at the very end instead of per-dimension

This NTT-domain persistence strategy reduces the number of NTT operations per authentication from dozens to the absolute minimum: one forward NTT for the encrypted probe, pointwise multiply-accumulate in NTT domain, one inverse NTT for the final result.

Lazy Reduction (Harvey Method)

Standard modular arithmetic reduces every intermediate result to the range [0, Q). The Harvey lazy reduction technique relaxes this constraint, allowing intermediate values to remain in [0, 2Q) between NTT butterfly stages. The benefits are twofold:

Fewer conditional branches — the modular reduction if (x >= Q) x -= Q is replaced with a cheaper bitmask operation that works on both in-range and out-of-range values without branching
Better pipeline utilization — modern superscalar CPUs (especially ARM Neoverse V2 on Graviton4) suffer performance penalties from mispredicted branches in tight loops. Eliminating branches keeps the instruction pipeline full.

Batch Processing

SIMD batching (packing 32 users per ciphertext) is the single largest throughput multiplier. But there is a second level of batching: pipeline batching, where multiple ciphertexts are processed concurrently across worker threads. H33 runs 96 parallel workers on Graviton4, each processing independent 32-user batches. The workers share no mutable state — all key material is read-only and shared via Arc — so there is zero contention.

Hardware-Specific Tuning

Different CPU architectures favor different optimization strategies:

ARM NEON (Graviton4)

128-bit SIMD registers. Excellent for branchless permutations (Galois rotation), vectorized key-switching, and add/subtract/compare operations. NEON lacks native 64x64-bit multiply-to-128-bit, so Montgomery multiplication stays scalar. H33 uses NEON for Galois operations and batch CBD sampling.

x86 AVX-512

512-bit SIMD registers with 52-bit integer multiply (IFMA). AVX-512 uses Shoup's method (precomputed quotient) rather than Montgomery REDC for modular multiplication. H33's Montgomery-based NTT is ARM-optimized; an AVX-512 port would require a structural rewrite to Shoup's method for maximum throughput.

Quantum Security

Both BFV and CKKS are built on the Ring Learning With Errors (RLWE) problem, which is a specific instance of the general lattice-based hardness assumption. RLWE's security reduces to the worst-case hardness of finding short vectors in ideal lattices — a problem for which no efficient quantum algorithm is known.

This is a critical advantage of FHE over classical encryption schemes. RSA and ECC will be broken by Shor's algorithm on a sufficiently large quantum computer. AES will have its effective security halved by Grover's algorithm. But RLWE-based schemes — including both BFV and CKKS — are inherently post-quantum.

Post-Quantum by Construction

You do not need to add a separate post-quantum layer to an FHE computation. The data is already encrypted under a lattice-based scheme that resists both classical and quantum attacks. This is why H33's full-stack pipeline (FHE + ZKP + Dilithium attestation) achieves end-to-end post-quantum security: the FHE layer protects the biometric data, and the Dilithium attestation protects the authentication verdict.

The best known quantum attack against RLWE is based on quantum lattice sieving, which provides at most a polynomial speedup over classical lattice sieving — far less than the exponential speedup Shor's algorithm provides against RSA. The Homomorphic Encryption Standard committee has published security estimates that account for quantum attacks, and H33's parameter choice (N = 4096, Q = 56-bit) meets the 128-bit post-quantum security threshold according to these estimates.

Other FHE Schemes Worth Knowing

BFV and CKKS dominate SIMD-batched workloads, but two other schemes are important for completeness.

BGV (Brakerski-Gentry-Vaikuntanathan)

BGV is the intellectual predecessor to BFV and shares the same exact-integer-arithmetic model. The key difference is noise management: BGV uses modulus switching as a mandatory step after every multiplication to reduce noise, while BFV uses a scale-invariant approach where modulus switching is optional.

In practice, BFV and BGV perform similarly for shallow circuits. BGV can be slightly more efficient for deep circuits because its modulus switching is more tightly integrated with the noise growth analysis. However, BFV's simpler noise model makes it easier to reason about and implement correctly. Most modern FHE libraries (Microsoft SEAL, OpenFHE, Lattigo) support both, and the choice between them is often an implementation detail rather than an architectural decision.

Dimension	BGV	BFV	CKKS	TFHE
Arithmetic	Exact integer	Exact integer	Approximate real	Boolean / small integer
Batching	SIMD (CRT)	SIMD (CRT)	SIMD (canonical emb.)	No native SIMD
Noise mgmt	Modulus switching	Scale-invariant	Rescaling	Bootstrapping (fast)
Best for	Deep integer circuits	Shallow integer circuits	ML, statistics	Comparisons, Boolean
Bootstrap cost	High (~100ms+)	High (~100ms+)	Moderate (~10-50ms)	Low (~10-100µs)
Quantum-safe	Yes (RLWE)	Yes (RLWE)	Yes (RLWE)	Yes (LWE)

TFHE (Torus FHE)

TFHE operates on individual bits or small integers (typically 2-8 bits) and evaluates arbitrary Boolean circuits gate by gate. Its distinguishing feature is programmable bootstrapping: a bootstrapping operation that not only refreshes the noise but also evaluates a lookup table in the process. This makes TFHE uniquely suited for:

Comparison operations — less-than, greater-than, and equality are native operations in TFHE but require expensive polynomial approximations in CKKS
Non-polynomial functions — any function that can be expressed as a lookup table can be evaluated during bootstrapping at no additional cost
Unbounded computation depth — since every gate includes a bootstrap, there is no noise accumulation and no limit on circuit depth

The trade-off is throughput. TFHE processes one bit at a time (or small integers), while BFV and CKKS process thousands of values in parallel via SIMD batching. For data-parallel workloads like biometric matching or ML inference, BFV and CKKS are orders of magnitude faster. For serial logic-heavy workloads like encrypted database filters with complex predicates, TFHE can be the better choice.

Decision Flowchart

Use this flowchart to determine the right FHE scheme for your workload:

Does your computation require exact integer results?

YES → BFV

↓ NO

Does your computation operate on real/floating-point numbers?

YES → CKKS

↓ NO

Does your computation involve complex comparisons or branching?

YES → TFHE

↓ NO

Is your circuit depth > 10 with exact integer arithmetic?

YES → BGV

↓ NO

Default for shallow integer circuits with SIMD batching

BFV

Conclusion

BFV and CKKS are not competing schemes — they are complementary tools designed for fundamentally different computational models. BFV provides exact integer arithmetic with zero precision loss, making it the only correct choice for applications where a single bit of error can change the outcome: biometric authentication, database equality checks, financial accounting, and electronic voting. CKKS provides efficient approximate arithmetic on real numbers, making it the natural choice for machine learning inference, statistical analysis, signal processing, and any workload where the input data is itself approximate.

H33 chose BFV for its production authentication pipeline because biometric matching is inherently an exact computation. The encrypted inner product between a probe template and an enrolled template must yield a bit-perfect result to ensure deterministic match/no-match verdicts. With BFV parameters tuned to the minimum necessary for this workload (N = 4096, single 56-bit Q, t = 65537), the entire pipeline runs at ~50 microseconds per authentication — fast enough to sustain 1.2 million authentications per second on a single Graviton4 instance.

For workloads that require approximate arithmetic, H33 also supports CKKS through the same API. The parameter selection, encoding, and computation mechanics differ, but the underlying RLWE security guarantee is identical. Both schemes are post-quantum by construction, and both benefit from the same NTT and key-switching optimizations.

The right scheme is the one that matches your computation's precision requirements. Everything else — performance, parameter sizes, implementation complexity — follows from that fundamental choice.

Ready to Go Quantum-Secure?

Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.

Get Free API Key →

The One-Sentence Decision

The Fundamental Distinction: Exact vs. Approximate Arithmetic

BFV: Noise is an Obstacle to Remove

CKKS: Noise is Part of the Answer

BFV Deep Dive

Mathematical Foundation

Key Generation

Encryption

Decryption

Noise Growth Analysis

Plaintext Modulus Selection: Why t = 65537

SIMD Batching via CRT

SIMD Batching Architecture

CKKS Deep Dive

The Approximate Arithmetic Model

The Rescaling Mechanism

Precision vs. Depth Trade-Off

Encoding Real Numbers to Polynomial Rings

When Approximate Is Good Enough

Head-to-Head Comparison

Use Case Decision Matrix

Biometric Matching → BFV

ML Inference → CKKS

Database Queries → BFV

Statistical Analysis → CKKS

Electronic Voting → BFV

Signal Processing → CKKS

Financial Ledgers → BFV

Genomic Analysis → CKKS

H33's Production BFV Configuration

Why BFV for Authentication

Parameter Choices and Rationale

Production Parameters

Montgomery NTT Optimization

Benchmark Data

Production Throughput

Parameter Selection Guidelines

BFV Parameter Selection

CKKS Parameter Selection

Performance Optimization Techniques

NTT-Domain Operations

Lazy Reduction (Harvey Method)

Batch Processing

Hardware-Specific Tuning

ARM NEON (Graviton4)

x86 AVX-512

Quantum Security

Post-Quantum by Construction

Other FHE Schemes Worth Knowing

BGV (Brakerski-Gentry-Vaikuntanathan)

TFHE (Torus FHE)

Decision Flowchart

Conclusion

Further Reading

Ready to Go Quantum-Secure?

Build With Post-Quantum Security

Related Articles