April 12, 2026 Engineering 16 min read

We load-tested our cache honestly. Here's what broke.

Eric Beans CEO, H33.ai

Earlier in this series I wrote a post walking through the three contended-lock bugs we found in our cache's observability code during production load testing. That post was about the bugs. This post is about the discipline that surfaced the bugs — the specific choices we made when setting up the benchmark that let us catch the regression before a customer did, and the broader principles about load testing that I think are underrated in production engineering discussions.

The short version: we benchmarked our cache at the target concurrency on the target hardware with the target workload shape, and the first result was a 9.3× throughput regression that we did not expect. If we had benchmarked at lower concurrency, different hardware, or a simpler workload, we would have shipped the regression into production. The fact that we caught it was not an accident. It was the result of specific decisions about how to run the benchmark.

I want to write about those decisions, because the pattern I see across production systems is that load testing is either skipped entirely (because it is expensive and teams are under time pressure) or done in a way that does not match production conditions closely enough to find the real bugs. This post is about what it looks like to do it honestly.

The thing we almost shipped

Let me start with what the benchmark run was supposed to be. The substrate cache layer was being integrated into our production signing pipeline. The integration path was: a customer request arrives, the pipeline looks up some metadata in a cache, if the metadata is cached the signing proceeds, if not the signing pipeline fetches from a slower backend and populates the cache. Standard cache-assisted request handling. The cache we were integrating was our own implementation — a two-tier concurrent hash map with a Count-Min Sketch admission filter, Prometheus-style metrics, and a workload pattern detector.

We had spent several weeks writing and testing the cache in isolation. The isolation benchmarks looked good. Single-threaded performance was in the expected range. The admission sketch was adding roughly 12 nanoseconds per operation, well within what we thought the pipeline could afford. The memory footprint was as designed — constant in the keyspace, roughly half a megabyte of sketch memory regardless of how many keys we tracked. All the unit tests passed. All the property tests passed. The coverage numbers were respectable.

The next step was to integrate the cache into the signing pipeline and benchmark the full pipeline against the baseline, which was the existing raw-DashMap path without the new cache layer. We expected the integrated cache to add somewhere between 0.5% and 2% overhead in the full pipeline — the kind of overhead you see when you instrument an existing data structure with additional observability. We built an integration benchmark, provisioned a bare-metal Graviton4 instance, and kicked off the run.

The baseline number came back at 1,708,400 auth/sec on the raw-DashMap path. That was in line with what we expected from the hardware and the pipeline. Good.

The integrated cache number came back at 183,828 auth/sec. That was 9.3× lower than the baseline. That was catastrophic.

My first reaction was "this can't be right, the benchmark must be broken." I re-ran it with different random seeds and different worker counts to see if the number was stable. It was stable. I ran it twice more just to make sure. 183,828, 184,200, 183,500. The range was tight. The number was real.

We had a 9.3× throughput regression that our unit tests had not caught, our property tests had not caught, our isolation benchmarks had not caught, and our code review had not caught. The regression existed because the integration conditions were different from any of the test conditions, and the bugs only manifested under those specific integration conditions. The bugs were in the observability code, as I described in the earlier post, and they only became visible when the cache was being called from 96 concurrent worker threads all competing for the same instrumentation locks.

If the benchmark had been done at lower concurrency — say, 8 worker threads — the regression would have been much smaller. The instrumentation locks would still have been a bottleneck in principle, but at 8 threads the lock contention would have cost maybe 10% of throughput, not 90%. A 10% regression would have been easy to miss. It would have looked like "the integrated cache is a bit slower than the baseline, but we can optimize later." We would have shipped it.

At 32 threads, the regression would have been maybe 40%, which is alarming but still within the range where someone might say "that's because of the admission sketch, let's investigate later." We would have filed a follow-up and shipped.

At 96 threads, the regression was 89.2%, which is not within any acceptable range. It was so clearly broken that we had no option but to debug it immediately, and the debugging path led us to the real bugs in the observability code.

The lesson I want to draw out is that the benchmark concurrency determined whether we caught the bugs. At a lower concurrency, we would have missed them. At the target concurrency, we could not miss them. The benchmark discipline of running at the target concurrency was the thing that made the difference between shipping broken software and shipping working software.

Why lower-concurrency benchmarks give false confidence

The specific reason lower-concurrency benchmarks do not catch lock-contention bugs is that lock contention is nonlinear in the number of contenders. At one worker, a lock takes zero measurable time (no contention). At two workers, a lock takes some small amount of time (one-in-two chance of contention per operation). At eight workers, the lock takes more time per operation. At 96 workers, the lock is effectively a serialization point for the entire operation, because the probability that any given thread finds the lock uncontested is very low.

The relationship between thread count and lock-contention cost is not even quasi-linear. It is closer to exponential in the range where the lock's service time is a small fraction of the operation's total time, and then it asymptotes to "all operations serialized through the lock" at high thread counts. This means that a lock that is "almost free" at 8 threads can be "completely broken" at 96 threads, and you cannot see the asymptotic behavior from benchmarks that only go to 8 threads.

A team that benchmarks at 8 threads, sees the "almost free" number, and extrapolates linearly to 96 threads will predict a throughput that is much higher than the actual throughput. The extrapolation is wrong because the underlying phenomenon is not linear. The team will be surprised in production when the 96-thread throughput does not match their prediction. The surprise can be catastrophic if the team has made commitments based on the predicted throughput.

The defense against this kind of surprise is to benchmark at the target concurrency, not at a scaled-down proxy. If production is going to run at 96 threads on bare-metal hardware with a specific memory architecture, the benchmark needs to run at 96 threads on bare-metal hardware with the same memory architecture. Anything less and you are extrapolating from a regime where the dominant phenomena are different from the phenomena that will dominate in production.

This is expensive advice. Bare-metal Graviton4 instances are not cheap to run. A benchmark session on one costs real money, and it takes real engineering time to set up, run, and analyze. The alternative — running on a cheaper instance type, at lower concurrency, for a shorter duration — is much more affordable, and it produces numbers that look plausible on a slide. The temptation to cut corners on benchmark infrastructure is strong. The reward for cutting those corners is that you ship the bugs.

The specific choices that made the benchmark honest

Let me walk through the specific decisions we made for this benchmark run, because each decision matters and each decision is one that a less disciplined approach would have done differently.

Decision 1: Target concurrency. Our production target for the signing pipeline was 96 concurrent worker threads on a bare-metal Graviton4 instance. The benchmark ran at exactly 96 worker threads. We did not benchmark at 8 or 16 or 32 and extrapolate. We did not use a thread pool that "could handle up to 96 threads but usually runs fewer"; we committed exactly 96 threads to the workload for the duration of the measurement window.

Decision 2: Target hardware. Production runs on AWS c8g.metal-48xl instances, which have 192 vCPUs of Graviton4 (ARM Neoverse V2) on bare metal, not virtualized. The benchmark ran on the same instance type. We did not use a c8g.16xlarge (a smaller Graviton4 variant) or a Nitro-virtualized c8g.48xlarge (a faster-to-provision alternative with the same vCPU count). The difference between metal and Nitro on Graviton4 is approximately 6% in sustained throughput for this workload (we measured it separately), which is small but not zero. For the integration benchmark, we wanted to eliminate the Nitro-tax variable, so we ran on metal.

Decision 3: Target workload shape. The signing pipeline does a specific sequence of operations per request: FHE inner product with 32 users per SIMD batch, followed by a ZKP lookup against the cache, followed by a SHA3 digest, followed by a Dilithium sign and verify. The benchmark replicated this exact sequence, at the exact batch size, with the exact dataset distribution that matches production. We did not use a synthetic workload that only exercised one part of the pipeline; we used the full pipeline from end to end. This is important because the cache's interaction with the surrounding pipeline is where the instrumentation bugs lived — not in the cache in isolation.

Decision 4: Sustained duration. The benchmark ran for 30 seconds of sustained throughput measurement, not a single-burst measurement. Thirty seconds is long enough for the system to reach steady state (JIT-compiled code is warmed up, caches are populated, allocators are in their steady-state memory layout) and for any contention-induced slowdowns to manifest. A one-second burst benchmark would have shown higher throughput because the instrumentation locks would not have had time to accumulate their contention cost yet. Thirty seconds gives the contention time to bite.

Decision 5: No retry averaging. We reported the raw benchmark results, including the run that showed 183,828 auth/sec. We did not "average out" the bad run with some good runs to produce a cleaner number for marketing. We did not skip the benchmark and call it a measurement artifact. We treated the bad number as a real signal about a real problem, and we investigated.

Decision 6: Compared against a baseline we could defend. The baseline was the raw-DashMap path in the same pipeline, on the same hardware, at the same concurrency, for the same duration. The comparison between baseline and integrated was apples-to-apples: the only difference was the cache layer between the hash lookup and the signing operation. This matters because a benchmark comparison is only meaningful if the two configurations differ in exactly the variable you are trying to measure. Changing multiple variables at once gives you a number but not a signal.

Decision 7: Three independent fixes. When we found the three bugs, we fixed each one in isolation and re-ran the benchmark after each fix. We did not bundle the fixes into a single big change and re-benchmark at the end. The isolation let us attribute the throughput recovery to each fix individually — Bug 1 fix took us from 183K to 395K, Bug 2 fix took us from 395K to approximately 1.4M, Bug 3 fix took us from 1.4M to 1.67M. If we had bundled them, we would have gotten the final number but we would not have known which fix contributed how much.

Decision 8: Published the narrative. We published the three-bug story openly, including the initial 9.3× regression, as part of the production engineering narrative around the substrate. We did not hide the regression or quietly fix it in a silent release. The narrative is more valuable than a polished benchmark graph because it shows that we know how to find and fix the bugs we were not looking for, and because it gives customers a signal about our production engineering discipline that a polished graph cannot give.

Each of these decisions individually is the kind of decision that engineers sometimes cut to save time or money. Each of them individually is defensible as a corner to cut in a specific context. Collectively, they are what the difference between "honest benchmark" and "benchmark that gives you a number" looks like. A team that cuts any one of them is likely to miss bugs. A team that cuts several of them is likely to ship regressions into production.

What "honest" means operationally

I keep using the word "honestly" in the title and in the body of this post, and I want to be precise about what it means operationally.

An honest benchmark, in the sense I am using, has the following properties:

1. It runs at target scale, not scaled-down scale. The concurrency, hardware, workload shape, duration, and memory pressure should all match production as closely as possible. Shortcuts in any of these dimensions produce numbers that do not predict production behavior.

2. It reports all runs, not cherry-picked runs. If you run the benchmark ten times and one run is an outlier, report the outlier. Do not silently drop it. Investigate it. Maybe the outlier is a real bug — maybe there is a specific condition that triggers the slow path and you need to know what it is.

3. It describes the methodology. A benchmark number without methodology is meaningless. The methodology should be detailed enough that a reader can evaluate whether the number is meaningful for their workload and, if necessary, reproduce the benchmark themselves. Hidden methodology is a red flag.

4. It holds variables constant between compared configurations. If you are comparing configuration A to configuration B, the two configurations should differ in exactly one variable (the thing you are measuring). Changing multiple variables at once produces ambiguity about what the measurement means.

5. It is willing to report bad news. If the benchmark finds a problem, the benchmark run should be the source of truth about the problem, not a shield against reporting it. A team that is unwilling to report a bad benchmark result is a team that is going to ship regressions, because the only benchmarks they are willing to communicate are the ones that come back clean.

6. It is open to independent replication. Ideally, the benchmark should be run-able by anyone with access to the relevant hardware and the published code. If the benchmark cannot be replicated, the numbers are just a claim. If it can be replicated, the numbers are a testable claim. Testable claims are worth more than untestable claims.

Meeting all six of these criteria is expensive. It takes hardware budget, engineering time, and a cultural willingness to report bad news about your own systems. Most teams do not meet all six most of the time. The teams that consistently meet all six are the teams that ship the fewest regressions, and the correlation is not a coincidence.

The cost of not doing it this way

Let me be specific about the counterfactual. If we had not caught the three lock bugs during internal benchmarking, here is what would have happened.

The substrate integration would have shipped to production with the cache layer running at 183,828 auth/sec instead of the intended 2,216,488 auth/sec. That is a 9× throughput shortfall. For a customer whose workload is designed around the expected 1.6M auth/sec ceiling, the actual 183K ceiling would not support their peak load. They would see elevated p99 latencies. They would see request queuing. They would see errors when the queue overflowed. They would see their SLOs breached.

The customer's first response would have been to file an incident with us. The incident investigation would have involved us trying to reproduce the regression in a test environment, which (under our original testing discipline) would not have reproduced it — because our tests ran at lower concurrency, and at lower concurrency the bugs are not visible. We would have spent days debugging. During those days, the customer would have been in degraded operation. The customer's trust in us would have eroded. The public communication around the incident would have been "we are investigating an unexpected performance regression in a recent release," which is a sentence that consumes real credibility every time it is said.

Eventually, someone would have suggested running the benchmark at the target concurrency in a test environment, and the bugs would have been found. The fixes would have been the same three fixes we actually deployed. The time-to-fix would have been hours once the bugs were identified. But the time-to-identify would have been days or weeks, because the symptom (elevated latencies) does not immediately point to the cause (lock contention in observability code), and the team would have chased many dead ends before landing on the real cause.

The total cost of the incident path versus the honest-benchmark path is something like: one afternoon of engineer time and one bare-metal instance hour, versus several days of engineer time, one degraded-service incident, one customer-trust hit, and the public communication cost of explaining what went wrong. The honest-benchmark path is the cheaper of the two by a factor of approximately twenty.

This is the operational argument for honest benchmarking. It is not a matter of engineering virtue or craftsmanship pride. It is a matter of expected cost. The honest benchmark is cheap insurance against the incident path, and the incident path is expensive enough that the insurance pays for itself many times over across any reasonable product lifetime.

What to do with the lesson

If you are building infrastructure that will run under production concurrency, I recommend the following.

Budget for load testing at target concurrency. Treat it as non-optional. Put the hardware cost in the quarterly budget. Put the engineer time in the sprint allocation. Make it a hard requirement before every significant release. If leadership pushes back on the cost, explain the incident-path alternative.

Benchmark the specific workload, not a synthetic proxy. The gap between a real workload and a synthetic workload is where bugs hide. Use the real pipeline. Use real request distributions. Use real data shapes. Do not use a loop that hammers one code path; use the full sequence of operations that a real request produces.

Report all runs, including the bad ones. Build a practice of treating benchmark results as signals, not marketing assets. If a run comes back bad, investigate it before dismissing it. The bad run is almost always telling you about a real problem that you would otherwise have missed.

Document the methodology every time. When you report a benchmark number, include the concurrency, hardware, workload, duration, and any relevant configuration flags. Reports without methodology are not reproducible, and irreproducible reports are not trusted.

Treat the three-bug story as a normal outcome. Finding three bugs in observability code during load testing is not a failure. It is the benchmark doing its job. The failure mode is not finding them at benchmark time; the failure mode is not finding them at all. The three-bug story is what benchmark discipline looks like when it works.

Be willing to publish the methodology. If you can publish the benchmark methodology along with the results, do so. The specific numbers are less valuable than the methodology, because the methodology lets readers evaluate whether the numbers are meaningful for their own workloads. Polished numbers with no methodology are marketing. Numbers with methodology are engineering.

Closing

The three-bug story is the most credible thing the substrate has going for it in terms of production engineering. More credible than our throughput numbers, more credible than our memory benchmarks, more credible than the security argument for the three-family bundle. The reason is that anyone can produce a throughput number with some effort, but very few teams voluntarily publish the account of what broke when they first integrated.

Load testing is a discipline, not a marketing exercise. When it is done as a discipline, it finds bugs. When it finds bugs, you publish what you found and how you fixed it. When you publish what you found, customers can evaluate your engineering culture from the evidence rather than from the polish. This is the kind of signal that compounds over time: customers who see you publish a bad-benchmark-to-fixed-benchmark narrative are more willing to bet on you for the next release, because they have evidence that your process catches things.

We will keep doing it this way. The next bug that the benchmark finds will get the same treatment. The substrate's production engineering story is built on the assumption that honest benchmarking is cheaper than the incidents that honest benchmarking prevents, and the evidence so far is consistent with that assumption.

The next post in this series is about Bitcoin chain metering — the specific mechanism by which we bill commercial customers from the public Bitcoin blockchain, without requiring the customer to send us any usage reports or allowing either side to dispute the billing count. See you there.

Build with the H33 Substrate

The substrate crate is available for integration. Every H33 API call now returns a substrate attestation.

Get API Key Read the Docs