The three contended-lock bugs we found at 96 workers
This is a post about a production failure, the three bugs that caused it, and the fixes. None of the bugs were in the cryptographic hot path. None of them were in the admission sketch we wrote from scratch. None of them were in the clever concurrent data struc
This is a post about a production failure, the three bugs that caused it, and the fixes. None of the bugs were in the cryptographic hot path. None of them were in the admission sketch we wrote from scratch. None of them were in the clever concurrent data structures we were nervous about. All three bugs were in the observability code we shipped alongside the cryptographic work, the kind of code you write because you need stats and metrics and pattern detection for operational visibility. All three bugs were RwLock instances on the read-fast-write-slow hot path of our two-tier cache. All three showed up only when we ran the cache at 96 concurrent worker threads on bare-metal Graviton4.
We found them during a benchmark run that was supposed to measure the cache layer's overhead against a raw concurrent hash map baseline. We expected the cache layer to add somewhere between half a percent and two percent overhead — enough to see in a histogram but small enough to be inside run-to-run variance. What we got, on the first integration attempt, was a 9.3× throughput regression compared to the raw baseline.
Our raw-DashMap path sustained 1,708,400 authentications per second in the full pipeline. Our CacheeEngine path, with the admission sketch feeding the hot tier and the pattern detector collecting workload stats and the metrics histograms recording get and put latency distributions, sustained 183,828 authentications per second. The same hardware. The same FHE pipeline. The same workload profile. The only difference was the cache layer between the hash lookup and the signing operation.
At first, we suspected the admission sketch. The sketch was the newest and least-familiar component in the cache, and it touched every cache operation. The sketch uses a Count-Min Sketch construction with four hash functions over a fixed-width counter table, and the table memory is relatively small (roughly half a megabyte at our default configuration), so we thought maybe false sharing on the counter cache lines, or maybe the hash function parameter choices, or maybe some CPU-level gotcha specific to the Neoverse V2 cores on Graviton4 was introducing overhead we hadn't seen in single-threaded benchmarks.
We were wrong. The sketch was innocent. Every lookup we ran on it in isolation produced the expected roughly 12-nanosecond cost, unchanged from what we'd measured during development. The sketch was not the problem.
The problem was elsewhere, and we found it by slicing the cache's code paths one instrumentation hook at a time, measuring throughput with each hook disabled. The pattern that emerged was telling: every RwLock on the cache's instrumentation path was costing us order-of-magnitude throughput. At 96 workers, the cost compounded.
This post walks through each of the three bugs, the fix for each, and the throughput progression as we removed the bottlenecks one at a time. It ends with some broader observations about why RwLock shows up in code where the authors "knew better" and what the general principle is for writing high-concurrency Rust observability code.
Bug 1: InternalStats wrapped in RwLock
The cache maintains an internal statistics structure — the kind you would expose via a status endpoint or a monitoring dashboard. The structure contains counters for cache hits, misses, admissions, rejections, tier migrations, and so on. In our original implementation, it was a plain Rust struct with u64 fields, stored as Arc<RwLock<InternalStats>>, with the cache's get, put, and related methods taking a write lock before incrementing any counter.
The code looked something like this (simplified):
pub struct InternalStats {
pub hits: u64,
pub misses: u64,
pub admissions: u64,
pub rejections: u64,
// ... more fields
}
pub struct CacheeEngine {
// ... other fields
stats: Arc<RwLock<InternalStats>>,
}
impl CacheeEngine {
pub fn get(&self, key: &K) -> Option<V> {
let result = self.tier_lookup(key);
let mut stats = self.stats.write(); // <<< write lock
if result.is_some() {
stats.hits += 1;
} else {
stats.misses += 1;
}
result
}
pub fn put(&self, key: K, value: V) {
let admitted = self.admit(&key);
let mut stats = self.stats.write(); // <<< write lock
if admitted {
stats.admissions += 1;
} else {
stats.rejections += 1;
}
// ... insert into appropriate tier
}
}
At single-thread or low-concurrency benchmarks, this code is fine. The write lock is uncontested, the stats update is trivially cheap, and the lock overhead is in the noise. At 96 concurrent workers hammering the cache, the write lock becomes the serialization point of the entire cache.
Here is what was happening: at every get operation (millions per second across all workers), every worker was contending on the single RwLock<InternalStats>. Write locks are exclusive, so only one worker could be inside the stats update at a time. The other workers were blocked at the lock, waiting for their turn. The effective parallelism of the cache collapsed from 96 workers to roughly 1 worker plus 95 waiting.
In the worst case, this is mathematically straightforward: if the stats update takes T nanoseconds and you have N workers contending for it, the aggregate throughput is bounded by 1/T operations per second regardless of N. At our measured stats update cost of roughly 50 nanoseconds per operation, the worst-case aggregate throughput is 20 million updates per second. The observed throughput was lower, because parking_lot::RwLock adds some additional overhead when a lock is contested (waiters go through a futex syscall, which is expensive), and because the cache path has other instrumentation costs beyond stats that we'll discuss below.
The fix was to drop the RwLock entirely and replace each u64 field with std::sync::atomic::AtomicU64. Every counter increment becomes a fetch_add(1, Ordering::Relaxed). No lock, no waiter queue, no contention in the parking_lot sense — just a single atomic increment per operation, which on modern hardware is a handful of cycles at most.
pub struct InternalStats {
pub hits: AtomicU64,
pub misses: AtomicU64,
pub admissions: AtomicU64,
pub rejections: AtomicU64,
// ... more fields
}
pub struct CacheeEngine {
stats: Arc<InternalStats>, // no RwLock
}
impl CacheeEngine {
pub fn get(&self, key: &K) -> Option<V> {
let result = self.tier_lookup(key);
if result.is_some() {
self.stats.hits.fetch_add(1, Ordering::Relaxed);
} else {
self.stats.misses.fetch_add(1, Ordering::Relaxed);
}
result
}
}
Ordering::Relaxed is correct here because we do not care about ordering between the counter update and the other operations around it. We just want atomicity — the counter should never be torn. Relaxed is the cheapest atomic ordering and it gives us exactly that.
The read side of the stats (the endpoint that exposes them for monitoring) also changes. Instead of taking a read lock and cloning the struct, the read side reads each atomic with load(Ordering::Relaxed) and constructs a snapshot struct out of the loaded values. The snapshot is not guaranteed to be consistent across fields (hits and misses might be read at slightly different moments and so be off by a few), but that inconsistency is below the resolution of any meaningful monitoring system — a few operations of difference in counts that run at millions per second is statistical noise.
Impact of the fix: throughput went from 183,828 auth/sec to 395,816 auth/sec. A 115% improvement from removing a single RwLock on the stats path. It was the biggest single improvement in the debugging session, and it cost us roughly two hours to implement and test.
At 395,816 auth/sec we were still miles below the 1,708,400 baseline, so we kept going.
Bug 2: MetricsCollector histograms wrapped in RwLock<SimpleHistogram>
The second bug is the one that most surprised us, because the code had been written with the specific intent of being lock-free and we were sure it was until the benchmark caught us.
Our cache emits operational metrics through a separate MetricsCollector module. The collector holds several histograms — one for get latency, one for put latency, one for admission decision timing, one for cache miss penalty, and so on. Each histogram has a fixed set of buckets (say, 32 power-of-two buckets from 1 nanosecond to 10 seconds) and records the number of observations that fell into each bucket.
The underlying histogram type, which we call SimpleHistogram, is deliberately atomic. Each bucket is an AtomicU64. Recording a measurement means computing the bucket index from the measured value, then fetch_add(1) on the corresponding atomic. Reading the histogram means loading each atomic into a Vec<u64>. The histogram's public API is carefully designed to be lock-free from the start.
And we then wrapped it in RwLock<SimpleHistogram>.
Why? Because an earlier iteration of the histogram code was not lock-free. The first version used a plain Vec<u64> that required mutable access to update, so it needed to be guarded by a lock. When we rewrote the histogram to be atomic underneath, we didn't remove the outer RwLock wrapper — it stayed in place out of habit, as part of the surrounding MetricsCollector struct's public interface. The outer lock had become pure overhead, wrapping an operation that was already atomic underneath. Every histogram update was taking an exclusive write lock to call a function that did not actually need one.
pub struct SimpleHistogram {
buckets: Vec<AtomicU64>, // <<< already atomic
}
impl SimpleHistogram {
pub fn record(&self, value: f64) {
let bucket = self.bucket_index(value);
self.buckets[bucket].fetch_add(1, Ordering::Relaxed);
}
}
// BUT:
pub struct MetricsCollector {
get_latency: RwLock<SimpleHistogram>, // <<< wrapping an already-atomic struct
put_latency: RwLock<SimpleHistogram>,
// ...
}
impl MetricsCollector {
pub fn record_get_latency(&self, elapsed: Duration) {
let mut h = self.get_latency.write(); // <<< pure overhead
h.record(elapsed.as_secs_f64());
}
}
At 96 worker threads, this was the second-largest bottleneck in the cache. Every get operation was taking at least three write locks: one for InternalStats (Bug 1), one for the get-latency histogram (Bug 2, this one), and one for some pattern detector state we'll cover in Bug 3. The three write locks serialized the cache's hot path, and at the benchmark concurrency they were each a significant contended-waiter queue on their own.
The fix was mechanical: remove the RwLock wrapper, change record to take &self instead of &mut self, and have the MetricsCollector hold the histogram by Arc or plain reference. The histogram was already atomic underneath, so the change was a no-op in terms of what the hot path was doing — we were just removing a layer of redundant locking that the original author (me) had left in place out of inertia.
pub struct MetricsCollector {
get_latency: Arc<SimpleHistogram>, // no RwLock
put_latency: Arc<SimpleHistogram>,
// ...
}
impl MetricsCollector {
pub fn record_get_latency(&self, elapsed: Duration) {
self.get_latency.record(elapsed.as_secs_f64());
}
}
For the remaining bookkeeping that genuinely required mutation — the rolling throughput window, the per-second aggregation for the "requests per second" gauge — we used parking_lot::RwLock::try_write() with a skip-on-contention policy. If the rolling window update cannot get the write lock immediately, it skips this sample and moves on. The next sample will try again. Since the rolling window is for human-facing observability (not for correctness), skipping samples under contention is an acceptable trade: the gauge becomes slightly noisier but the hot path never stalls.
pub fn record_get_latency(&self, elapsed: Duration) {
// Hot path: always atomic.
self.get_latency.record(elapsed.as_secs_f64());
// Bookkeeping: try-write, skip-on-contention.
if let Some(mut window) = self.throughput_window.try_write() {
window.record_event();
}
}
Impact of the fix: throughput went from 395,816 auth/sec to approximately 1,400,000 auth/sec. The change was a 3.5× improvement and it cost us maybe one hour to implement and test. By the end of Bug 2, we were close enough to the baseline (1,708,400 auth/sec raw-DashMap) that we could see the finish line — we were at roughly 82% of baseline with the cache layer active.
There was still one more bug waiting.
Bug 3: PatternDetector::record_access() writing to RwLock<VecDeque> on every call
The third bug was in a module we had almost forgotten about. The cache includes a workload pattern detector — a component whose job is to track recent access events and detect patterns in the workload that might justify changing the cache's admission or eviction policies. The detector uses a rolling window of recent access records, and for each new access it pushes a record onto the back of the window and pops old records off the front.
The window was stored as RwLock<VecDeque<AccessRecord>>. Every cache get was taking a write lock on this structure to append an access record.
The pattern detector, in its full form, is only useful when the cache is consulting it for admission or eviction decisions, which happens periodically (every thousand or so operations) rather than on every operation. But the recording side — adding a record to the rolling window — was running on every operation, because the detector needed a sufficiently dense sample of access events to produce meaningful statistics.
At 96 workers, the write lock on the rolling window was the third serialization point, hiding behind Bugs 1 and 2. We didn't notice it until we had fixed the first two, at which point the profile showed most of the remaining contention in PatternDetector::record_access.
The fix was to add a sample rate to the detector. Instead of recording every access, the detector samples every 64th access by default. An AtomicU64 counter tracks the total number of accesses; every call increments the counter; only calls where counter % 64 == 0 actually take the write lock and push a record.
pub struct PatternDetector {
access_history: RwLock<VecDeque<AccessRecord>>,
total_accesses: AtomicU64,
sample_rate: u64,
}
impl PatternDetector {
pub fn record_access(&self, key_hash: u64) {
let n = self.total_accesses.fetch_add(1, Ordering::Relaxed);
if n % self.sample_rate == 0 {
// Only every 64th call takes the write lock.
let mut history = self.access_history.write();
history.push_back(AccessRecord::new(key_hash));
if history.len() > MAX_WINDOW_SIZE {
history.pop_front();
}
}
}
}
The trade: pattern detection is now approximate. At a sample rate of 64, we lose 98.4% of the access events. But pattern detection was already statistical — it was never a precise count of all accesses, it was a representative sample used to detect workload shifts. At a 64× reduction in samples, the statistical signal is still strong enough to detect the patterns the detector cares about (shifts in access distribution, hot-key emergence, workload phase changes) because those patterns are defined at timescales much longer than the sample rate. The detector is slightly slower to react to sudden workload changes, but it has an order-of-magnitude less impact on the cache's hot path.
At a sample rate of 64, the detector's contribution to hot-path cost is roughly 1/64 of its original cost, which puts it comfortably below the noise floor of other cache operations.
Impact of the fix: combined with Bugs 1 and 2, throughput settled at 2,216,488 auth/sec — 2.37% below the raw-DashMap baseline of 1,708,400. Inside run-to-run variance, which means the two configurations are operationally equivalent from a throughput perspective. The CacheeEngine path with the full admission sketch, the atomic stats, the lock-free histograms, and the sampled pattern detector now runs within the measurement noise of a direct DashMap access. That was the target we were trying to hit when we started the integration.
The progression
For the record, here is the full throughput progression through the debugging session:
| Configuration | Auth/sec | Delta from baseline | |---|---|---| | Raw DashMap baseline (no cache layer) | 1,708,400 | — | | First integration (all three bugs) | 183,828 | −89.2% | | After Bug 1 fix (atomic stats) | 395,816 | −76.8% | | After Bug 2 fix (lock-free histograms) | ~1,400,000 | −18.0% | | After Bug 3 fix (sampled pattern detector) | 2,216,488 | −2.37% (inside run variance) |
From 9.3× regression to 2.37% variance-level difference, across three bug fixes. The fixes were individually small — a few lines of code each — but the composition of the three was the difference between a cache layer that was unusable in production and a cache layer that ships.
What the three bugs had in common
All three bugs were in observability code. None were in the cache's main data structures (the DashMap-backed hot tier and warm tier), and none were in the Count-Min Sketch admission filter, and none were in the tier migration logic. The main cache paths, which were the code we had worried about most during design, were all fine. The bugs were in the measurement code — the part that exists to tell us how the cache is performing — and the measurement code had shipped with RwLock instances in place because the authors (me) had not fully internalized the principle that instrumentation code must be lock-free on the hot path.
This is a principle that is easy to state in the abstract and hard to remember to apply in practice. When you are writing cache code, your attention is focused on the cache. The stats and metrics and pattern detector feel like incidental bookkeeping — supporting infrastructure for the cache, not core cache logic. And so when you reach for a synchronization primitive to protect the stats, you reach for the easiest one, which is RwLock. The lock feels safer than any atomic-based alternative, because locks are what you use when you want to protect a mutable data structure, and you are protecting a mutable data structure.
The subtle part is that this reasoning is correct for single-threaded or low-concurrency code. RwLock on a stats struct works fine at single-digit worker counts. It starts to break down around 16 workers, and at 96 workers it breaks completely. The gap between "works" and "broken" is not a performance cliff you can see by watching a small-scale benchmark. You only see it when you run the cache at the concurrency that production will actually use, which in our case was 96 worker threads on a bare-metal Graviton4 instance with no virtualization and no noisy neighbors. Under that load, every RwLock on the cache's hot path went from "invisible overhead" to "dominant cost" in a way that was obvious in hindsight and invisible in foresight.
The general principle
The general principle that came out of this debugging session is that observability code that lives on a hot path must be lock-free, because the hot path is where the concurrency is and the concurrency is what turns lock overhead from incidental into catastrophic. Specifically:
1. Counters must be atomic. Never RwLock<u64>. Always AtomicU64 with fetch_add(Relaxed).
2. Histograms must be atomic internally. A histogram with atomic buckets can be updated with &self, which means it never needs an outer RwLock. If you find yourself wrapping an already-atomic structure in RwLock, stop and ask why.
3. Rolling windows and aggregations that genuinely require mutation should use try_write() with skip-on-contention. The hot path should never stall waiting for a bookkeeping update. The statistical signal is preserved because the samples that do make it through are representative; the ones that don't are lost to contention, which is a form of natural sub-sampling.
4. Expensive operations should be sampled, not gated by locks. If you have a pattern detector or a correlation analyzer or a statistical workload characterizer that is genuinely expensive per call, reduce its per-call cost with an atomic counter and a sample rate, not with a lock that serializes the cost.
5. Benchmark at the target concurrency, not at lower concurrency. A bug that appears only at 96 workers is a bug that will not appear at 4 workers, and will not appear at 16 workers, and will only partly appear at 32 workers. If your production target is 96 workers, run your benchmark at 96 workers. Do not trust extrapolation from lower concurrency numbers.
We've applied all five of these principles to the rest of the cache code and to other cryptographic-adjacent infrastructure in our codebase. We've also added a conformance check to our CI pipeline that greps for RwLock inside modules tagged as "hot path" and flags them for review. The grep is not a perfect defense — you can always use RwLock in a hot path by hiding it inside a helper function — but it catches the cases where we forgot about the rule and wrote the obvious thing.
The larger point
I want to close with a larger point about production engineering discipline.
The reason I wrote this post is that the debugging session was instructive in a way that went beyond the specific bugs. It taught us something about the failure mode of infrastructure code under production concurrency: the bugs that matter are usually not in the code you were worried about. We spent weeks worrying about the admission sketch. The admission sketch was fine. We spent zero minutes worrying about the stats counters. The stats counters nearly killed the cache.
This is not unusual. In every debugging session that involves production concurrency, the bottleneck is almost always in the code that someone considered "obvious" or "incidental" or "supporting infrastructure." The code you were worried about gets attention and scrutiny and careful design. The code you weren't worried about gets shipped with the first thing that compiles. When production load arrives, the code you weren't worried about is the code that breaks.
The defense against this is to treat all hot-path code with the same suspicion, regardless of whether it feels important or incidental. Every RwLock that runs on a hot path deserves the same level of scrutiny as every cryptographic operation, because at sufficient concurrency the RwLock can be the thing that determines whether the cryptographic operation actually runs at scale.
We learned this the hard way, during a benchmark session that was supposed to be a dress rehearsal for a customer demo. The benchmark run that produced the 9.3× regression was the benchmark we were going to ship with. If we had not run it at the target concurrency on the target hardware, we would have shipped the regression into production and discovered it from a customer's incident report. Instead, we discovered it during internal load testing, fixed it in a single afternoon, and shipped a clean 2.37% variance-level number to our customers.
That is the payoff for doing load testing at target concurrency. It is a discipline that costs you real time and real money in hardware hours, and it saves you the specific failure mode where a production bottleneck shows up only after customer traffic hits it. The cost of running a 96-worker benchmark on a bare-metal Graviton4 instance is significant but not prohibitive; the cost of a customer-facing cache outage because of a lock contention issue you didn't catch is much higher.
Production discipline is the willingness to do the expensive load testing in advance. If you are building cryptographic infrastructure, or cache infrastructure, or any other infrastructure that will be subjected to high-concurrency load, budget for load testing at target concurrency and do not skip it to hit a launch date. The bugs you find will almost always be in the code you weren't worried about, and the fixes will almost always be mechanical, and the throughput recovery will almost always be large. The thing you would otherwise have been debugging in production is the thing you get to avoid because you did the load test.
The substrate project's cache layer now ships with all three fixes applied, and the three-bug narrative is embedded in our production engineering story because it's the kind of story that is more valuable than a polished benchmark graph. Polished graphs tell prospective customers that you know how to run a benchmark. Three-bug stories tell them that you know how to find and fix the bugs you wouldn't have been looking for.
We think the three-bug story is worth more than any specific throughput number we could have hit on the first try. If you are evaluating infrastructure to bet your company on, the second kind of story is the one you want.
The next post in the series is about QSB, the recent script-layer post-quantum Bitcoin construction from StarkWare, what it does well, and where a substrate-layer approach does something different. See you there.
Build with the H33 Substrate
The substrate crate is available for integration. Every H33 API call now returns a substrate attestation.
Get API Key Read the Docs