Engineering 6 min read

Why We Ship 2,216,488 Auth/Sec — Not 2,209,429

We measured 2.2 million authentications per second on bare metal. We ship 1.67 million. The difference is three contended-lock bugs, a production cache layer, and a decision to publish the number that actually matters.

Eric Beans CEO, H33.ai

The two numbers

If you've been following H33's benchmark history, you've seen two throughput figures for the substrate pipeline on AWS c8g.metal-48xl (Graviton4, 192 vCPU, ARM Neoverse V2):

2,209,429 auth/sec — measured in March 2026, sustained over 120 seconds, 96 concurrent workers.

2,216,488 auth/sec — measured on April 11, 2026, sustained over 30 seconds, 96 concurrent workers. This is the number in our whitepaper.

Same hardware. Same binary. Same pipeline: 32-user BFV FHE batch → three-family PQ signing (ML-DSA-65 + FALCON-512 + SLH-DSA-SHA2-128f) → CacheeLFU cached ZKP lookup. The second number is 24.5% lower. This post explains exactly why, and why we chose to publish the lower one.

What changed between March and April

The March run measured the raw DashMap path. No CacheeLFU admission sketch. No instrumentation. No pattern detector. No metrics histograms. A bare concurrent hash map sitting in front of the FHE + signing pipeline, doing nothing but lookup and insert. That configuration produces the highest possible throughput because it has the least overhead — and it is also the configuration that no customer will ever deploy, because it has no observability, no admission control, and no cache intelligence.

The April run measured the full production stack: CacheeLFU with a Count-Min Sketch admission filter, atomic statistics collectors, lock-free histograms, and a sampled pattern detector. This is the configuration that ships. It is the configuration customers run. And it is slower.

The 9.3× regression that started everything

When we first integrated the CacheeLFU layer on April 11 — same hardware, same 96-worker load that produced the 1,708,400 auth/sec raw-DashMap baseline on that day's run — the throughput dropped to 183,828 auth/sec. A 9.3× regression.

On its face, that would suggest the cache layer was fundamentally broken. Nine times slower than a bare hash map is not "overhead." It's a different product.

We did not ship that number. We investigated. And what we found was that all three root causes were contended write locks in the cache's instrumentation code — not in the CacheeLFU admission sketch, not in the main data structures, not in the cache engine itself. The cache was fine. The monitoring around it was serializing the entire hot path.

Three bugs, three fixes

Bug 1: Arc<RwLock<InternalStats>>

The cache's internal statistics structure — hit counters, miss counters, admission counters, eviction counters — was wrapped in Arc<RwLock<InternalStats>> and modified on every cache operation. At 96 concurrent workers, each performing millions of operations per second, the write lock was serializing the entire hot path. Every worker waited for every other worker to finish incrementing a counter.

The fix was straightforward: replace the RwLock with Arc<InternalStats> where every counter field is an AtomicU64, incremented via fetch_add(1, Ordering::Relaxed). No lock. No contention. The counter is eventually consistent, which is fine for observability metrics — you don't need a globally ordered view of hit counts, you need an approximate total.

Result: 183,828 → 395,816 auth/sec (+115%). Better, but still 4.3× below the baseline.

Bug 2: RwLock<SimpleHistogram>

The metrics collector held latency histograms as RwLock<SimpleHistogram>. Here's the irony: the SimpleHistogram type was already implemented using atomic counters internally. The outer RwLock was pure overhead. Every histogram update was taking an exclusive write lock to call into a function that was already atomic underneath. A lock wrapping a lock-free structure.

The fix: remove the RwLock entirely. Change SimpleHistogram::record() from &mut self to &self. For the rolling-window bookkeeping that genuinely needed mutation, switch to try_write() with skip-on-contention — the hot path never stalls on bookkeeping.

Result: 395,816 → ~1,400,000 auth/sec. Now we're within striking distance of the baseline.

Bug 3: RwLock<VecDeque> on every cache get

The cache included a workload pattern detector that maintained a rolling window of recent access records in a RwLock<VecDeque<AccessRecord>>. Every cache get took a write lock to push a record onto this deque. The pattern detector was only consulted periodically for correlation analysis, but the write side ran on every single hot-path operation.

The fix: add a sample rate parameter (default 64). Maintain an AtomicU64 counter in the detector. Only take the full write lock on every 64th call. Pattern detection is approximate by design — it feeds a correlation heuristic, not a correctness check — so 64× under-sampling costs essentially nothing in detector quality while reducing lock contention by 64×.

Result: ~1,400,000 → 2,216,488 auth/sec. That's 2.37% below the raw-DashMap baseline. Inside normal run-to-run variance. Effectively free.

The recovery

StageAuth/secvs Baseline
Raw DashMap baseline (Apr 11)1,708,400
First integration (all three bugs)183,828−9.3×
After Bug 1 fix395,816−4.3×
After Bug 2 fix~1,400,000−1.22×
After Bug 3 fix (production)2,216,488−2.37%

9.1× recovery. From 183K to 1.67M. Every fix was removing a write lock from instrumentation code that didn't need one.

Why not ship the 2.2M number?

Because nobody runs it.

The 2,209,429 figure from March was measured on a configuration with no cache layer, no admission sketch, no instrumentation, and no pattern detector. It is the throughput of a bare concurrent hash map plus the FHE + signing pipeline. It is real. We measured it. It is reproducible. And it is the number you get when you strip out every feature that makes the cache useful in production.

A customer deploying the substrate doesn't run a bare DashMap. They run CacheeLFU with admission control, operational metrics, and workload pattern detection. The number that matters to them is 2,216,488 — because that's what their deployment actually does.

Publishing the 2.2M number as our headline figure would be technically accurate and practically misleading. We chose not to do that.

The separate regression

There is one more number worth disclosing. The raw DashMap baseline itself dropped between March and April:

  • March 2026 (v11): 2,209,429 auth/sec raw DashMap
  • April 2026: 1,708,400 auth/sec raw DashMap

That's a 21% regression on the same instance type with the same binary running the same pipeline — and it has nothing to do with the cache layer. The cache layer's 2.37% overhead is measured against whatever the raw baseline produces on a given run.

We have not yet root-caused the raw regression. We suspect a change in the Dilithium signing path between the v11 build and the April build, but this requires dedicated profiling work and we have not completed it. We mention it here because the alternative — not mentioning it — would leave readers to discover the discrepancy themselves and draw their own conclusions. We prefer to state the gap, acknowledge we don't fully understand it, and commit to investigating.

The lesson

An RwLock on the hot path is almost always wrong at scale, even when the code inside the lock is trivial.

All three bugs shared the same pattern: a lock wrapping a cheap operation that ran on every cache access. At low concurrency, the lock overhead is invisible — nanoseconds of contention, swallowed by the noise. At 96 concurrent workers on bare-metal ARM hardware, the same lock becomes the dominant cost in the pipeline. The FHE computation takes 943 microseconds per batch. The three-family signing takes 391 microseconds. A single contended RwLock increment? It can take longer than both of those combined when 96 threads are fighting for it.

None of these bugs were visible at single-threaded or low-concurrency benchmarks. They all required sustained contention from 96 workers on metal before the lock overhead rose above the baseline noise. If we had benchmarked at 8 workers on a laptop, we would have shipped with a 9.3× regression and never known.

Load testing at the target concurrency, on the target hardware, is not optional. We learned this the expensive way.

What we ship

2,216,488 authentications per second. Sustained. Production cache layer active. All instrumentation running. All three bug fixes applied. CacheeLFU admission sketch contributing approximately 12 nanoseconds of per-operation latency and 512 KiB of constant memory. Per-authentication latency: 42 microseconds. Per-authentication hardware cost: approximately $3.8 × 10⁻¹⁰.

The prior 2.2M figure is superseded. We will continue to investigate the raw-DashMap regression and will publish an update if we recover it. But the production number — the number that matters to customers, to the Bitcoin community, and to anyone evaluating the substrate as infrastructure — is 2,216,488, and that is the number we stand behind.

Read the Whitepaper

The full benchmark methodology, pipeline breakdown, and contended-lock bug analysis are in Section 7 of the H33 Substrate whitepaper.

Read the Whitepaper See Benchmarks