1,239× memory reduction from a Count-Min Sketch, measured
This post is about a specific measurement we made during the last production benchmark: the memory footprint of a Count-Min Sketch used as a cache admission filter, compared against the baseline of tracking the same information in a straightforward concurrent
This post is about a specific measurement we made during the last production benchmark: the memory footprint of a Count-Min Sketch used as a cache admission filter, compared against the baseline of tracking the same information in a straightforward concurrent hash map. The measurement is unremarkable in the sense that everyone who has read about Count-Min Sketches knows they are memory-efficient. What makes this measurement worth a blog post is the specific ratio at the specific scale — 1,239× smaller at 10 million keys — and what that ratio means in production for a cache that needs to make admission decisions without paying the memory cost of tracking every key individually.
Let me explain what the measurement is, why it matters for cache admission specifically, what the sketch does and does not do well, and the operational consequences of having a memory-invariant admission filter under high load.
The baseline: tracking all keys
Start with the straightforward case. You have a cache. You want to make smart admission decisions: which keys belong in the hot tier (small, fast), which keys belong in the warm tier (bigger, slower), and which keys should be rejected from the cache entirely because they will not be accessed enough to justify the space they would occupy.
To make those decisions well, you need some estimate of how frequently each key is being accessed, so that high-frequency keys get admitted to the hot tier and low-frequency keys get admitted to the warm tier or rejected. A naive approach is to maintain a hash map from key to access count: every time you look up a key, you increment its count; every time you make an admission decision, you consult the count and compare it against a threshold. If you have N distinct keys being tracked, the hash map takes O(N) space, and each access requires an O(1) hash map update.
At small scales, this naive approach is fine. At 100 thousand distinct keys, a hash map from 16-byte keys to 8-byte counts takes a few megabytes of memory, well within the comfortable range for a production cache. At 1 million distinct keys, it is tens of megabytes — still acceptable. At 10 million distinct keys, it is hundreds of megabytes, and the hash map itself starts competing for L3 cache space with the actual cached data, which is the opposite of what you want. At 100 million distinct keys, the hash map is multiple gigabytes, and at a billion distinct keys, it is tens of gigabytes.
The problem is that cache admission is a metadata problem. The metadata is not what you are caching; it is what tells you what to cache. A cache admission filter that uses more memory than the cache itself has inverted the economics of caching. You would be better off just admitting everything and letting the hot tier's eviction policy handle the rest, because the admission filter is now costing you more memory than it is saving.
For a cache that is meant to scale to very large keyspaces — say, a cache tracking which attestation receipts have been seen across millions of requests per second from thousands of distinct clients — the naive hash map approach is not viable. You need a different data structure.
The Count-Min Sketch alternative
The Count-Min Sketch (CMS) is a probabilistic data structure, introduced by Cormode and Muthukrishnan in 2003, that approximates a key-to-count mapping using a fixed amount of memory that does not grow with the number of distinct keys being tracked. The structure works like this:
You pick d independent hash functions, each mapping keys to a fixed range w. You allocate a two-dimensional array of counters with d rows and w columns, all initialized to zero. To record an access for a key, you compute all d hash functions, and for each one, you increment the counter at the corresponding (row, column) cell. To estimate the count of a key, you compute all d hash functions, read the counters at the corresponding cells, and take the minimum. The minimum is an upper bound on the true count — it is never too small, and it is approximately correct when the hash functions avoid collisions on the key you are querying.
The memory footprint of a CMS is d w sizeof(counter), regardless of the number of distinct keys being tracked. If you pick d = 4 and w = 128,000 and use u32 counters, the total memory is 4 128,000 4 = 2,048,000 bytes, or about 2 megabytes. If you pick smaller parameters, it is smaller. A common production configuration uses 4 hash functions and 131,072 (2^17) counters per function with 8-bit saturating counters, for a total of 524,288 bytes — approximately half a megabyte. This half-megabyte sketch can track access frequencies across 100 thousand keys, or 1 million keys, or 10 million keys, or a billion keys. The memory does not grow.
The cost of this memory invariance is imperfect accuracy. The sketch's estimated count for a given key can be larger than the true count (because of hash collisions from other keys), though it is never smaller. The magnitude of the overestimate depends on the sketch's parameters and the total number of operations — larger w means fewer collisions means tighter bounds. For cache admission decisions, the overestimate is usually acceptable: if the sketch says a key has been accessed many times, the key is admitted; if the sketch is slightly pessimistic about the count (saying a key has been accessed more than it really has), the worst case is that we admit a key that does not quite deserve admission, which causes a minor efficiency loss but not a correctness problem.
The Count-Min Sketch is not new. It has been deployed in production systems for two decades — in network flow monitoring, in streaming analytics, in database query optimization, and in cache admission filters for memcached-like systems. There is nothing novel about using it. What is worth measuring, for our specific workload, is the exact trade-off in memory size between the sketch and the baseline hash map, and what the trade-off looks like as the keyspace grows.
The benchmark: three keyspace sizes
We ran a benchmark at three keyspace sizes: 100,000 keys, 1 million keys, and 10 million keys. At each size, we measured:
1. The memory footprint of a straightforward DashMap<Key, u64> (a concurrent hash map in Rust) holding all the keys and their counts. 2. The memory footprint of our Count-Min Sketch configured with 4 hash functions, 131,072 counters per function, and 8-bit saturating counters. 3. The throughput of each data structure under both single-threaded and multi-threaded workloads, to make sure the memory savings were not coming at a prohibitive performance cost.
The keys for the benchmark were randomly generated 16-byte blobs, drawn from a distribution similar to what we expect in production (hashes of request payloads, cache keys, content hashes, and so on). Each keyspace size represents a different scale of operation for the cache — 100K is "small deployment," 1M is "medium deployment," 10M is "large production deployment."
The benchmark was run on AWS c8g.metal-48xl hardware (Graviton4, 192 vCPU, 371 GB RAM, Amazon Linux 2023). The benchmark code is in our production code base and has been running as part of our nightly CI since the substrate integration work began. The numbers in this post are from the specific run on 2026-04-11, and the full raw logs are retained for audit.
The memory numbers
Here are the raw measurements from the benchmark:
| Data structure | Keyspace | Memory footprint | |---|---|---| | DashMap baseline | 100,000 keys | 6.20 MiB (6,500,080 bytes) | | DashMap baseline | 1,000,000 keys | 61.99 MiB (65,000,080 bytes) | | DashMap baseline | 10,000,000 keys | 619.89 MiB (650,000,080 bytes) | | Count-Min Sketch | 100,000 keys | 512.17 KiB (524,464 bytes) | | Count-Min Sketch | 1,000,000 keys | 512.17 KiB (524,464 bytes) | | Count-Min Sketch | 10,000,000 keys | 512.17 KiB (524,464 bytes) |
The first thing to notice is that the DashMap baseline grows roughly linearly with the number of keys, as expected. Going from 100K to 1M is a 10× increase in keyspace and a 10× increase in memory (6.20 MiB to 61.99 MiB). Going from 1M to 10M is another 10× increase in keyspace and another 10× increase in memory (61.99 MiB to 619.89 MiB). The "roughly linearly" is because DashMap has some per-shard overhead and some allocator overhead that does not scale linearly, but the dominant term is the O(N) storage for the keys and counts.
The second thing to notice is that the Count-Min Sketch is constant in memory across all three keyspace sizes. 512.17 KiB at 100K. 512.17 KiB at 1M. 512.17 KiB at 10M. The sketch's memory footprint is a property of its parameters (4 hash functions × 131,072 counters × 1 byte per counter, plus some header), and the parameters do not change as a function of keyspace. The sketch is not "taking up" 512 KiB on behalf of all the keys it tracks; it is reserving 512 KiB up front and then absorbing all the tracking work into that fixed allocation.
The third thing to notice is the ratio of baseline-to-sketch memory at each keyspace size:
| Keyspace | Baseline | Sketch | Ratio | |---|---|---|---| | 100K | 6.20 MiB | 512.17 KiB | 12.4× | | 1M | 61.99 MiB | 512.17 KiB | 123.9× | | 10M | 619.89 MiB | 512.17 KiB | 1,239.4× |
The ratio grows with keyspace, because the baseline grows and the sketch does not. At 100K keys, the sketch is about 12× smaller than the baseline. At 1M, it is 124× smaller. At 10M, it is 1,239× smaller. If we extrapolated to 100M or 1B keys, the ratio would be 12,394× and 123,940× respectively — the sketch's advantage grows unboundedly with the keyspace size because the sketch is constant and the baseline is not.
The 1,239× number at 10M keys is the specific measurement we are highlighting because 10M keys is a realistic target scale for a production cache serving attestation receipts at the kind of throughput the substrate infrastructure is designed for. At smaller scales, the ratio is still impressive but less dramatic. At larger scales, the ratio becomes absurd — a gigabyte versus a megabyte, then a terabyte versus a megabyte. The practical implication is that the sketch enables a class of cache deployments that would be infeasible with a baseline approach.
The throughput side
A memory benchmark is only useful if the data structure is also fast enough for the workload. Let me show the throughput numbers too.
At single-threaded operation, the sketch runs at roughly 25 million operations per second on a single Graviton4 core. The baseline DashMap runs at roughly 10 million operations per second under the same conditions. The sketch is faster at single-threaded because it is touching fewer cache lines — 4 counter increments (one per hash function) against a 512 KiB region of memory — whereas the DashMap has to do hash map lookups, equality checks, and potential collision chain walks.
At multi-threaded operation, the gap narrows. At 8 threads, sketch and DashMap are roughly equivalent at ~7 million ops/sec each. At 32 threads, both are around 11 million ops/sec. At 96 threads (the production target), sketch is around 14 million ops/sec and DashMap is around 12 million ops/sec. Both structures are bounded by the same concurrency limits at high thread counts — atomic contention on the counter cache lines for the sketch, and lock contention on the shard latches for the DashMap — and the throughputs converge.
The key takeaway is that the sketch is not slower than the DashMap at any thread count. The throughputs are comparable, with slight advantages in different directions at different concurrencies. The sketch's memory savings do not come at a throughput cost.
Why the 8-bit saturating counter is fine
One detail worth explaining: we use 8-bit saturating counters in the sketch, not 64-bit counters. An 8-bit counter can hold values from 0 to 255 and saturates at 255 (an additional increment after 255 is a no-op, not an overflow to 0). The reason we use 8-bit counters instead of 32-bit or 64-bit counters is twofold.
First, 8-bit counters are 8× smaller than 64-bit counters. Our sketch memory would be 4 megabytes instead of 512 kilobytes if we used 64-bit counters, and that 4 megabytes would displace other useful data from L1/L2/L3 caches. The smaller counter width is a direct cache-efficiency win.
Second, for cache admission decisions, we do not need the actual count — we need to know whether the count exceeds a threshold. The threshold we care about is "has this key been accessed at least T times in the current window, where T is something like 8 or 16 or 64?" — always a small number, well below the saturation limit of an 8-bit counter. If a key's count has reached 255, we know with certainty that the count is above any threshold we might care about, and we have no further decisions to make for that key (it is going to be admitted). The saturation behavior is a feature, not a bug: it gives us a compact representation of "more than enough" without losing the information we need.
The sketch also decays counters periodically. Every T operations (where T is a configurable reset interval), the sketch halves every counter in place. This gives us a rolling window behavior — recent accesses count more than older accesses — and it prevents the sketch from saturating forever on keys that were hot once but are no longer hot. The halving operation is cheap because it is a single pass over the 512 KiB sketch array with an atomic_fetch_sub(current_value / 2) at each cell. On a Graviton4 core, the halving takes a few hundred microseconds to walk the entire sketch, and it happens every several million operations, so the amortized cost is negligible.
What the sketch does not do
In the spirit of the honest disclosure sections from my earlier posts, here is what the Count-Min Sketch does not do.
It does not track individual key counts exactly. The sketch's estimate is an upper bound on the true count. For cache admission purposes, the upper bound is fine — if the sketch says a key is hot, it probably is hot — but you cannot use the sketch as an exact counter for any key. If you need exact counts (for billing, for audit, for compliance), you need a separate data structure.
It does not detect individual key accesses. The sketch tells you approximately how often a key has been accessed, but it does not give you a log of which keys were accessed when. If you need access logs, you need a different structure (a log, a time-series database, or whatever).
It does not handle deletions. A Count-Min Sketch increments counters on access; there is no corresponding decrement on deletion, because there is no way to associate a deletion with the specific counter cells that were incremented for a given key without remembering which cells those were. If you need deletion support, you need a different variant (Count-Min with Conservative Update, Count-Mean-Min, or others) and the variants have their own trade-offs.
It does not give you the set of keys. The sketch only remembers counter values, not the keys themselves. If you need to know "what keys are in the cache?", the sketch cannot answer that question. It can only answer "is this key in the cache often enough to admit?" for a specific key you already have.
It is not suitable for security-critical applications where an attacker can control the hash functions. The sketch's accuracy depends on the hash functions being effectively random with respect to the keys. An adversary who can choose keys to maximize collisions in the sketch's hash functions can produce arbitrarily large overestimates. For cache admission under benign workloads, this is fine. For adversarial workloads, you would need to use a keyed hash function with a secret seed, which we do in our implementation.
These limitations are not surprising. The sketch is a specific tool for a specific problem, and it is very good at that problem. Using it for problems outside its design space gives bad results, as with any tool.
Why this matters for the substrate cache
The substrate's commercial cache infrastructure, which we call Cachee in internal documents and which we wrote about in a different post in this series covering the three contended-lock bugs, uses the Count-Min Sketch as its admission filter. The sketch decides which keys are hot enough to get admitted to the hot tier of the two-tier cache, and which keys should go to the warm tier instead.
The reason the sketch's memory invariance matters is that we can run the same cache at any keyspace size without reconfiguring the admission filter. A deployment tracking 100K distinct content hashes uses the same 512 KiB sketch as a deployment tracking 10M distinct content hashes. No parameter tuning, no reconfiguration, no "pick the right sketch size for your workload" step during deployment. You just get a 512 KiB admission filter that works at every scale in the production target range.
This is operationally valuable for two reasons. First, it means we can ship a single default configuration that works for small and large customers. Second, it means customers do not have to predict their peak keyspace size when they provision the cache — if their workload grows from 100K to 10M distinct keys, the cache's memory footprint does not grow, and they do not have to resize anything. The cache just keeps working.
For the commercial licensing model we ship the cache under, this operational invariance is one of the properties we highlight when customers ask about how the cache scales. The honest answer is: "up to the throughput limits of the underlying hash map (which is DashMap, running lock-free), with no admission filter memory growth in the keyspace." The admission filter is effectively free at scale, which is the property that makes the cache usable for the workloads the substrate is designed to support.
The broader point about data structure choice
I want to close this post with a broader observation about choosing data structures for production systems.
The Count-Min Sketch is more than twenty years old. It is not a secret, it is not a trick, and it is not a new invention. Any competent systems engineer who has thought about approximate counting has encountered it. The reason we are writing a blog post about it is not that the sketch is surprising — it is that the measurement is clean, and the measurement illustrates a specific property (constant memory in keyspace cardinality) that is important for the substrate's production workload.
The broader point is that production data structure decisions often come down to choosing between an "obvious correct" structure (like a hash map from key to count) and a "less obvious better fit for the specific constraints" structure (like a Count-Min Sketch when exact counts are not required). The obvious choice is often fine at small scale and breaks at production scale. The less obvious choice is often less pleasant to work with, has more caveats, has worst-case behavior that needs to be understood, and requires careful thinking about what the system actually needs — but it works at scale where the obvious choice cannot.
Production engineering, most of the time, is the discipline of choosing the right "less obvious better fit" structure for each piece of the system, not the discipline of using the obvious structure everywhere and hoping it scales. The obvious structures are obvious because they are easy to reason about in the common case. They are not obvious because they are always the right choice.
For the substrate cache, a Count-Min Sketch was the right fit for the admission filter, because the admission decisions only need approximate counts, the memory invariance is worth the approximation error, and the throughput is comparable to a hash-map-based alternative. Knowing when a CMS is the right fit is a specific kind of systems-engineering judgment, and it is the kind of judgment that improves with reps and with reading about how other production systems made similar choices.
I hope this post was one of those reps for somebody. The measurement is the measurement: 1,239× at 10 million keys, constant memory, throughput comparable to the baseline. Use it, or don't, depending on your workload. But if you are building a cache admission filter and you have not considered a Count-Min Sketch, now you have seen the numbers.
The next post in this series is about batched Merkle aggregation and the specific construction that lets a single 74-byte root substrate commit to a billion leaves with log-depth inclusion proofs. See you there.
Build with the H33 Substrate
The substrate crate is available for integration. Every H33 API call now returns a substrate attestation.
Get API Key Read the Docs