BenchmarksStack Ranking
APIsPricingDocsWhite PaperTokenBlogAboutSecurity Demo
Log InGet API Key
Engineering 17 min read

The internet was built to move data. Substrate is what you use when it stops.

This is a vibes-heavy post, but the vibes point at something real. The argument I want to make is that a H33-74, deployed as a tag-exchange protocol over a network like Bitcoin or Nostr, lets systems trade compact tags instead of underl

Eric Beans CEO, H33.ai

This is a vibes-heavy post, but the vibes point at something real. The argument I want to make is that a H33-74, deployed as a tag-exchange protocol over a network like Bitcoin or Nostr, lets systems trade compact tags instead of underlying data, and the consequences for bandwidth, latency, and provenance are structural rather than marginal. This is not a new observation in the content-addressing literature — IPFS has been making versions of this argument since 2014 — but the substrate's specific composition of content addressing with three-family post-quantum attestation and fixed-width persistent state is a specific take on the pattern that I think is worth describing on its own terms.

Let me start with the title claim and work backwards.

The assumption the internet was built on

The internet's original design assumption was that data moves. When a client wants a piece of content, the client asks the server for it, and the server sends the bytes across the network to the client. The network's purpose is to be a conduit for those bytes, and the network's efficiency is measured by how fast and how reliably it can move them from sender to receiver. Every protocol in the internet stack — TCP, IP, HTTP, SMTP, FTP, BitTorrent, every one of them — is a mechanism for moving data from one place to another. The data is the thing; the protocols are how the data gets where it needs to go.

This assumption worked well for the internet's first few decades because the data sizes were small relative to the network's capacity. A web page in 1995 was a few kilobytes. An email was smaller. A file download was megabytes at most. The network's throughput was the bottleneck, but the network's throughput was also growing faster than the data sizes, so the bottleneck was always moving in a favorable direction.

Over the last two decades, that has changed. Data sizes have grown faster than network throughput in several specific domains. Machine learning model weights are hundreds of gigabytes to terabytes — GPT-4 is approximately 3.6 terabytes of weights, and Llama 405B is roughly 810 gigabytes. Training datasets are hundreds of terabytes. Video streaming at high resolution is tens of megabits per second per stream. Genomic data per patient is gigabytes. Scientific simulation outputs are routinely in the petabyte range. Docker images for modern applications are gigabytes. Database backups and snapshots are terabytes. The data that people actually want to move has grown much faster than the network's ability to move it.

The result is that "the network moves data" has become a partial strategy. For small data, the network still moves it. For large data, the network has increasingly become a coordination mechanism for not moving the data — a way for two systems to agree on what data they are both looking at, without actually transferring the data between them. Content delivery networks like Cloudflare and Akamai were the first major instance of this pattern: the content is cached at the edge, and the client fetches from whichever edge is closest, avoiding the cross-ocean transfer that the original internet architecture would have required. Peer-to-peer file sharing protocols like BitTorrent were another instance: the data is not moved from a central server; it is retrieved from whoever has a copy, using a coordination protocol that tells you where copies exist.

Content-addressable storage systems like IPFS took this pattern further: data is identified by its cryptographic hash, and the hash is what gets moved across the network. The actual data stays wherever it already is, and the client fetches it only when the client actually needs the content. In IPFS, a 74-byte content identifier can represent a file of any size, and the content identifier is what travels across the wire, while the file stays in whichever nodes have pinned it.

This is the architectural pattern the substrate extends. The idea is: if the data is already somewhere, and the network can agree on a small identifier for it, the network should move the identifier instead of the data. The substrate's 74-byte persistent state is the identifier in this pattern, and the substrate's three-family post-quantum signing is what makes the identifier tamper-evident and forge-resistant.

The size invariant

Here is the property that makes the substrate interesting as a network primitive. The substrate's persistent state is 74 bytes regardless of the size of the data it commits to. This is the critical invariant, and it is what lets the substrate function as an identifier for arbitrarily large data.

A substrate committing to a 1 kilobyte JSON record is 74 bytes. A substrate committing to a 1 megabyte video frame is 74 bytes. A substrate committing to a 1 gigabyte machine learning dataset is 74 bytes. A substrate committing to a 1 petabyte scientific archive is 74 bytes. The size of the substrate does not depend on the size of the data — it depends on the output width of SHA3-256 (32 bytes for the content hash) plus the other wire-format fields that are fixed regardless of content.

This is not a compression claim. The data is not being compressed to 74 bytes. The data is being addressed by 74 bytes, with the addressing being cryptographically bound to the content via the hash. If you want the actual data, you have to go fetch it from wherever it lives. The substrate tells you what to fetch, not where to fetch it from and not how to reconstruct it from the substrate alone.

But the addressing property is enough to do something powerful: two systems that both have access to the underlying data can communicate about that data by exchanging 74-byte substrates, without moving the data between them. If I have a machine learning model, and you have the same machine learning model, and we both know the substrate that addresses that model, we can coordinate on "use this model" by exchanging the 74-byte substrate. The model does not move. The model is already in both places. We just need to agree on which specific model we are both using, and the substrate gives us that agreement in 74 bytes.

Generalize this to many-party coordination: if a network of systems all have the same model cached locally, and they all agree on the substrate that addresses the model, they can refer to the model by its substrate in any subsequent communication. The communication cost of referencing a 3.6 terabyte model is 74 bytes, not 3.6 terabytes. The 74 bytes is cryptographically bound to the model by the content hash, so there is no ambiguity about which model is being referenced.

Federated machine learning as a concrete application

Let me make this concrete with a federated machine learning example, because it illustrates the bandwidth-decoupling property nicely.

Consider a federation of machine learning agents that want to share training updates. In the current dominant architecture, each agent trains a local copy of the model on local data, then sends the gradient updates (or the weight deltas) to a central parameter server, which aggregates the updates into a new version of the model. The parameter server then distributes the new model to all the agents, who resume training. This is the standard federated learning loop.

The bandwidth cost of this loop is dominated by the weight transmission. Each agent sends gradient updates to the parameter server, and each update can be hundreds of megabytes or more depending on the model size. The parameter server then sends the updated model back to each agent, and the updated model can be gigabytes. For a federation of a hundred agents doing federated training on a model with billions of parameters, the total bandwidth per training round is measured in terabytes.

Now consider the same federation using substrate tags. Each agent produces its gradient update locally. Instead of sending the update to a central server, the agent computes a substrate for the update — 74 bytes — and publishes the substrate to a shared network (say, a gossipsub overlay or a blockchain anchor). The substrate's content hash commits to the update, and the substrate's three-family signature proves that this specific agent produced it. The parameter server observes the substrates from all agents, fetches each agent's update on demand from wherever it is available, aggregates them into a new model version, and then publishes a substrate for the new model. The agents see the new model's substrate on the network, fetch the new model from wherever it is available, and resume training.

The bandwidth cost of the "substrate tag" version of this loop is the bandwidth for fetching the updates and the model only from the nodes that need them. If an agent already has a copy of the current model cached, the agent does not need to refetch it — it just needs to see the substrate to know that the current model is what it already has. If two agents are geographically close and can share gradient updates directly with each other, they can do so without routing through the central server. The network of substrate-tag-aware nodes can become a content-addressable mesh where data moves only where it needs to go, with the coordination happening entirely through 74-byte tag exchange.

The bandwidth savings from this are substantial, and the savings grow with the model size. For small models (kilobytes), the substrate tags are a marginal optimization — the difference between moving 10 KB and moving 74 bytes is not a meaningful fraction of the total. For large models (gigabytes to terabytes), the savings are dramatic — the difference between moving 3.6 TB and moving 74 bytes is effectively a factor of 5×10^10. Every order of magnitude increase in model size increases the substrate's relative bandwidth advantage.

This is the "federated learning on a tag network" pattern, and it is one of the applications we have been exploring most actively. The substrate's three-family post-quantum signing gives every gradient update a tamper-evident identity — agents can cryptographically verify that a gradient update was produced by a specific other agent, rather than having to trust the network layer to transport the update honestly. The content-addressable fetch pattern gives the federation the bandwidth efficiency of IPFS or BitTorrent, with the security properties of a post-quantum signing primitive layered on top.

Why post-quantum signing is load-bearing in this pattern

You might reasonably ask: if the point is content addressing, why does the substrate need three-family post-quantum signing at all? IPFS uses content addressing without any signing — the content hash is the address, and the address is the cryptographic commitment to the content. What does the substrate's signing layer add?

The answer is that signing matters when the network participants are not fully trusted with respect to each other, which is the case in most real-world federated systems.

In a purely cooperative federation where every participant is honest, content addressing is sufficient. A participant who publishes a content hash for some data is effectively saying "here is the address of the data I produced," and other participants can fetch the data and verify the hash matches. The hash verification proves the data was not tampered with in transit, and the participant's publication proves they had the data to publish. No signing needed.

In an adversarial federation, content addressing alone is not sufficient. A malicious participant can publish a content hash for data they did not actually produce, claiming authorship of someone else's work. An adversary can publish a content hash for garbage data and then refuse to serve the data when asked, creating an unserveable but referenced object that poisons the network. An adversary can publish a content hash for data with specific known-bad properties (a backdoored model, a manipulated dataset) and leave other participants to discover the bad properties only after they have relied on the data.

The substrate's three-family signing layer addresses each of these failure modes. Every substrate is signed by a specific producer, and the signature is produced by keys that are unique to that producer and that have been provisioned through the commercial licensing process (or through the non-commercial verifier's own key custody, for open-source uses). When a federation participant sees a substrate on the network, they can cryptographically verify which producer published it, and they can decide whether to trust that producer based on their own trust list. Malicious producers can be excluded from the trust list without affecting the rest of the federation. Honest producers remain trusted and their substrates remain valid.

The content addressing and the signing compose cleanly. Content addressing gives you "which data is this pointing to," and signing gives you "who said it was pointing to that." Together, they give you the cryptographic evidence needed to build a federated network where participants can share identifiers without having to trust each other, and where the participants can independently verify the cryptographic provenance of any identifier they receive.

The bandwidth argument, restated

To restate the bandwidth argument in direct terms: for any workload where multiple systems need to refer to the same underlying data, exchanging substrate tags instead of exchanging the data itself reduces per-reference bandwidth from O(data size) to O(1). The exchange cost is the fixed 74-byte substrate, regardless of how large the underlying data is.

This is not a subtle or marginal optimization. For large data objects, the bandwidth reduction is many orders of magnitude. For small data objects, the reduction is less dramatic but still structural — the network is doing fundamentally different work when it exchanges tags versus data, because it is exchanging a commitment rather than content.

The implications compound at high fan-out. If a million agents all need to refer to the same machine learning model, the bandwidth cost of distributing the model naively is (one million) × (model size). The bandwidth cost of distributing the substrate tag is (one million) × (74 bytes), plus the bandwidth for whoever actually needs to fetch the model and cannot find a local copy. In a well-seeded content-addressable network, most of the million agents will find a local or near-local copy of the model and will never have to fetch the full content — their only bandwidth cost is the 74-byte tag exchange.

This is the pattern that makes distributed machine learning at very large scale economically feasible. Without the tag-exchange optimization, distributing a trillion-parameter model to a million agents requires approximately a trillion (parameters) × one million (agents) × bytes-per-parameter of bandwidth, which is in the zettabytes. With the tag-exchange optimization and a well-seeded content-addressable network, the same distribution requires a few terabytes of initial seeding plus 74 bytes per tag broadcast, which is in the low terabytes total. The ratio is a factor of 10^9, which is the difference between "feasible with aggressive infrastructure" and "not physically possible."

The substrate's contribution to this pattern is not the content addressing — IPFS already had that — but the signing layer on top, which makes the pattern robust against adversarial participants. A federation that cannot trust all of its participants needs a way to cryptographically verify the provenance of every tag it receives, and the substrate is one way to provide that verification while keeping the tag exchange compact.

Beyond machine learning

Federated machine learning is the cleanest example of the pattern, but it is not the only application. Any workload where multiple systems need to reference the same data can benefit from tag exchange instead of data exchange, and there are several domains where the pattern is worth explicit consideration.

Video streaming at very large audiences. A live event streamed to a million viewers is, in the current architecture, a million concurrent bandwidth connections from the source (or from CDN edges) to the viewers. The bandwidth cost at the source is (one million viewers) × (stream bitrate), which can be tens of gigabits per second. Under a substrate-tag approach, the stream is chunked into segments, each segment is committed to a substrate, and viewers request the current substrate. The substrates are broadcast on a gossipsub-like overlay, and viewers fetch each segment from whichever peer has it (peer-to-peer video streaming is an existing concept — BitTorrent Live, P2P WebRTC streaming). The substrate adds cryptographic provenance to the per-segment content addressing, so viewers can verify that each segment was produced by the authentic source and has not been modified.

Scientific data sharing. Genomic data, astronomical observations, climate model outputs, and other scientific data objects are often terabytes to petabytes in size, and many research groups need to refer to the same datasets. Tag-based sharing lets a research group reference a specific dataset by its 74-byte substrate without actually transferring the dataset each time it is referenced. The dataset lives in whatever storage the producing institution provides, and other research groups fetch it only when they need it. The substrate's signing layer provides cryptographic provenance, which matters for scientific reproducibility — a paper that cites a dataset by its substrate can be rerun years later against the same dataset, because the substrate cryptographically identifies the exact bytes of the dataset that the paper used.

Legal evidence workflows. A large legal case can involve terabytes of digital evidence — emails, documents, database records, video footage. Each piece of evidence is typically referenced multiple times during discovery and trial. Tag-based referencing lets the evidence stay in whatever secure storage the case management system provides, while the substrate tags move across the workflow. The substrate's three-family signing provides non-repudiation: once a substrate is produced, the producer cannot claim they did not produce it, because the signature is cryptographically bound to their signing keys.

Government and regulatory document workflows. Regulatory submissions, tax filings, compliance reports, and other government document workflows often involve documents that are referenced many times but should not be repeatedly copied across systems. Tag-based workflows let the authoritative copy of each document live in the regulatory agency's systems while substrate tags represent the document in all the downstream workflows. The substrate's three-family signing provides long-term verification of each document's authenticity, which matters for audit trails that must remain valid for decades.

Software supply chain attestation. Modern software builds produce large artifacts — Docker images, compiled binaries, dependency trees — that are typically stored in artifact repositories and referenced by downstream deployments. Tag-based artifact referencing lets a deployment pull a specific artifact by its substrate, with cryptographic verification that the artifact is exactly the one the build process produced. The substrate's signing layer gives software consumers cryptographic evidence of provenance, which is a weaker version of Sigstore's ecosystem (and the substrate composes with Sigstore or a similar system at the artifact attestation layer).

Each of these applications has its own specific requirements, and the substrate is not necessarily the right tool for all of them. But the common pattern — data that is large, referenced many times, and produced by parties whose provenance matters — is the pattern where tag exchange provides significant value over raw data exchange, and the substrate's composition of content addressing with three-family signing is a specific take on the pattern that has specific advantages (fixed-width persistent state, post-quantum security, composition with Bitcoin anchoring) over the alternatives.

What this means for network architecture

The broader architectural observation is that networks designed around "move the data" are approaching the limits of the assumption. Bandwidth grows, but data sizes grow faster in several important domains. The next generation of network architectures will, increasingly, be designed around "move the references and cache the data wherever it is," and the specific cryptographic primitives that support this architecture will become important protocol-level infrastructure.

Content addressing is part of the answer. Tamper-evident signing on top of content addressing is another part. Post-quantum signing on top of tamper-evident signing is another part — because the networks built around tag exchange today need to still work in a post-quantum future. Persistent fixed-width state is another part, because networks that exchange tags at high volume benefit from the tags being as small as possible without sacrificing cryptographic strength. Integration with a public append-only log (like Bitcoin) is another part, because the tags need a neutral third-party witness to resolve disputes about what was signed when.

The substrate is a specific point design that combines all of these parts — content addressing via SHA3-256, tamper-evident signing via three post-quantum signature families, compact 74-byte persistent state, and Bitcoin anchoring for long-term immutability. Other combinations are possible. IPFS plus a separate post-quantum signing layer would get you something similar, with different tradeoffs. A Nostr-based tag network with post-quantum signatures would get you something similar. The substrate is one specific combination, and the combination has specific properties that make it a clean fit for the kinds of workloads I have described above.

For the broader networking community, the takeaway is that the assumption "networks move data" is becoming increasingly unnecessary as a design principle, and the architectures that best handle the next generation of workloads are the ones that treat data movement as an optional last-resort operation, with most coordination happening through compact tag exchange. The substrate is one tool in that architectural direction, and I suspect more tools with similar properties will emerge over the next few years as the pattern becomes more widely understood.

Closing

The internet was built to move data, and it is still very good at moving small data quickly. For large data, the network increasingly functions as a coordination mechanism for not moving the data — caching at the edge, fetching from peers, referencing by hash, composing tag networks on top of existing infrastructure. The substrate is a specific primitive for the tag-exchange layer, adding three-family post-quantum signing to content addressing and producing a compact 74-byte persistent state that functions as a tamper-evident reference.

For workloads that already need content addressing but also need cryptographic provenance beyond what IPFS provides, the substrate is a specific take on the combination. For workloads that can tolerate content addressing without strong signing, IPFS alone or similar is sufficient. For workloads that do not benefit from content addressing at all (small-data workloads, real-time communication, point-to-point messaging), the substrate does not add value and you should stick with whatever protocol your workload already uses.

The broader point is that "the internet was built to move data" is an observation about an assumption that is increasingly incomplete, and the architectures that acknowledge the incompleteness are building something different on top of the existing internet — a tag layer, a coordination layer, a commitment layer — that lets large data stay where it is while references to the data travel across the network. The substrate is one entry in that space, and the specific combination of properties it provides is worth considering if your workload fits the pattern.

The next post in this series is about why every computation type should have a byte, and why append-only byte-scale registries are an underrated piece of protocol design. See you there.

Build with the H33 Substrate

The substrate crate is available for integration. Every H33 API call now returns a substrate attestation.

Get API Key Read the Docs