What is wrong with tokenization?

Tokenization requires reading data in plaintext to determine what to mask or replace. The system must see a Social Security number to know it is a Social Security number. That reading step is the exposure. Every tokenization system creates a plaintext window where data is vulnerable to breach, insider access, or supply chain compromise.

How does H33 eliminate tokenization?

H33-Agent-Zero classifies documents and enforces policies on encrypted data without ever decrypting it. The document is encrypted on the client device, classified using CKKS homomorphic encryption, decisions are made using TFHE encrypted control flow, and policy tags are applied -- all without any system ever seeing the plaintext content.

What is the H33-Agent-Zero pipeline?

The pipeline is: document encrypted on device, CKKS classification on ciphertext produces encrypted class scores, TFHE threshold and argmax on encrypted scores select the classification, encrypted policy tags are applied, and enforcement actions execute. Zero plaintext exposure at any step.

What industries benefit from eliminating tokenization?

Legal (privilege classification without reading contracts), healthcare (PHI tagging without seeing patient data), banking (PII detection without customer data exposure), insurance (document classification without reading sensitive applications), and any industry where compliance requires data handling but security demands data protection.

Does encrypted classification affect accuracy?

No. CKKS homomorphic encryption preserves the mathematical operations used in classification models. The encrypted classification produces the same result as plaintext classification, minus negligible floating-point noise inherent to the CKKS scheme. The model accuracy is identical; only the exposure is eliminated.

The End of Tokenization

The Tokenization Paradox

Tokenization is the most widely deployed data protection technique in enterprise computing. Banks tokenize credit card numbers. Healthcare systems tokenize patient identifiers. Legal platforms tokenize document metadata. Insurance companies tokenize policyholder information. The logic is simple: replace sensitive data with a non-sensitive placeholder (a token), store the mapping in a secure vault, and use the token everywhere the real data would have appeared.

On paper, tokenization is elegant. In practice, it contains a paradox that the industry has spent two decades trying not to talk about: in order to tokenize data, you must first read it. A tokenization system must see the credit card number to know it is a credit card number. It must read the Social Security number to recognize it as a Social Security number. It must parse the patient identifier, the account number, the date of birth, the address -- it must see every piece of sensitive data in plaintext before it can decide what to mask and how to mask it.

That reading step is the exposure. It is not a side effect of tokenization. It is the mechanism of tokenization. Every tokenization system in production today creates a plaintext window -- a moment in time and a location in memory where sensitive data exists unencrypted, unprotected, and vulnerable. The tokenization vault itself is a high-value target: it contains every mapping from token to real value. Compromise the vault, and you have compromised every piece of data the system has ever tokenized.

The entire data protection industry has been built on this compromise. We protect data by reading it first. We mask data by seeing it first. We classify data by parsing it first. Every data loss prevention (DLP) tool, every data classification engine, every privacy-enhancing technology that relies on tokenization inherits this fundamental flaw.

H33-Agent-Zero eliminates it.

The Exposure Inventory

Before explaining how H33 eliminates tokenization, it is worth cataloging exactly how many exposure points exist in a traditional data protection pipeline. Consider a standard enterprise DLP system processing a document that may contain PII.

Step 1: Document ingestion. The document is received by the DLP system. It is decrypted from its transport encryption (TLS). The document content is now in plaintext in memory on the DLP server. Exposure point one.

Step 2: Content parsing. The DLP system parses the document to extract text, metadata, and structure. The full document content is read into parsing buffers. Exposure point two.

Step 3: Pattern matching. Regular expressions and machine learning models scan the parsed content for patterns that match PII categories: SSN patterns, credit card number patterns, email addresses, phone numbers, names, addresses. Every piece of sensitive data is read and evaluated. Exposure point three.

Step 4: Classification. Based on the pattern matching results, the system classifies the document: contains PII, contains PHI, contains financial data, contains privileged legal content. The classification decision is based on having seen the sensitive content. Exposure point four.

Step 5: Tokenization/masking. The system replaces detected sensitive values with tokens. To do this, it must read each sensitive value, generate or look up the corresponding token, and perform the substitution. Exposure point five.

Step 6: Vault storage. The token-to-value mappings are stored in a tokenization vault. The vault now contains every piece of sensitive data in a centralized, high-value target. Exposure point six.

Step 7: Policy enforcement. Based on the classification, the system applies policies: block transmission, allow with redaction, route to compliance review, quarantine for legal hold. The policy engine has access to the classification results, which were derived from seeing the sensitive data. Exposure point seven.

Seven exposure points in a single document processing pipeline. Each one is a location where data exists in plaintext, where a memory dump would reveal sensitive content, where an insider with access to the process could read the data, where a supply chain compromise of any component would expose everything flowing through it.

And this is the protection system. This is the system designed to keep data safe.

The H33-Agent-Zero Pipeline: Zero Exposure Steps

H33-Agent-Zero processes the same document through the same logical pipeline -- ingestion, classification, decision, policy enforcement -- but with zero plaintext exposure at any step. Here is how.

Step 1: Client-side encryption. The document is encrypted on the client device before it leaves. The content is encoded as numerical feature vectors and encrypted using CKKS (for classification) and TFHE (for decision logic). The encrypted feature vectors leave the device. The plaintext never does.

Step 2: CKKS classification on ciphertext. The encrypted feature vectors are processed by a classification model running on CKKS homomorphic encryption. CKKS supports the approximate arithmetic needed for neural network inference: matrix multiplications, additions, and activation function approximations. The model produces encrypted class scores -- a vector of encrypted floating-point values representing the probability that the document belongs to each category (PII, PHI, financial, privileged, public). The server performs these computations without ever decrypting the feature vectors or the scores. It sees ciphertexts in, ciphertexts out.

Step 3: TFHE threshold on encrypted scores. The encrypted class scores need to be converted into a decision: which category? Is any score above the detection threshold? This is where TFHE encrypted control flow takes over. The encrypted scores are discretized and compared against thresholds using GT (greater-than) comparisons. Each GT comparison on an n-bit discretized score costs 2n-1 programmable bootstraps. The argmax across categories uses a tournament of GT comparisons to select the highest-scoring category. All of this happens on encrypted data. The server evaluates the threshold checks and the argmax, but it cannot see the scores, the comparison results, or the final category selection.

Step 4: Encrypted policy tags. Based on the encrypted classification result, policy tags are applied. These tags are computed as encrypted Boolean values: "contains PII" is an encrypted bit, "requires legal hold" is an encrypted bit, "block external transmission" is an encrypted bit. Each tag is derived from the encrypted classification through additional TFHE Boolean gate evaluations. The server computes the tags but cannot read them.

Step 5: Encrypted enforcement. The encrypted policy tags drive enforcement actions. If the "block" tag is set (in its encrypted form), the document is routed to quarantine. If the "redact" tag is set, the document is marked for client-side redaction upon return. If the "allow" tag is set, the document proceeds. The server executes the routing logic on encrypted tags using encrypted MUX (multiplexer) operations. It performs the enforcement without knowing what the enforcement decision was.

Step 6: Client-side decryption. The client receives the encrypted result: the classification, the policy tags, and the enforcement disposition. The client decrypts these with its secret key and acts on the outcome. No server, no intermediate system, no DLP engine, no tokenization vault ever saw the document content.

Zero exposure points. Not seven. Zero.

What CKKS Does and Why It Matters Here

CKKS (Cheon-Kim-Kim-Song) is a homomorphic encryption scheme designed for approximate arithmetic on real numbers. Unlike BFV or BGV, which operate on exact integers, CKKS encrypts floating-point vectors and supports addition and multiplication with controlled precision loss. This makes CKKS the natural choice for machine learning inference, where model weights and activations are floating-point values and small precision differences do not affect classification outcomes.

In the Agent-Zero pipeline, CKKS handles the heavy lifting of document classification. A document is represented as a feature vector (from an embedding model, a TF-IDF representation, or any other numerical encoding). This vector is encrypted using CKKS and sent to the server. The server evaluates the classification model on the encrypted vector: matrix-vector multiplications for linear layers, polynomial approximations for activation functions, and addition for bias terms. Each CKKS ciphertext can pack up to 4,096 values using SIMD (Single Instruction, Multiple Data) slots, meaning a single ciphertext operation processes thousands of features simultaneously.

The output is an encrypted vector of class scores. These scores are precise enough for classification (the noise introduced by CKKS is orders of magnitude smaller than the differences between class scores in a well-trained model) but they are still encrypted. The server has no way to read them.

The transition from CKKS to TFHE happens at the decision boundary. CKKS computes the scores. TFHE decides what to do with them. This is the handoff from "encrypted compute" to "encrypted control flow," and it is the critical innovation that makes zero-exposure classification possible.

Why Tokenization Cannot Be Fixed

The natural objection is: "Can we fix tokenization instead of replacing it?" The answer is no, and the reason is structural.

Tokenization requires pattern recognition on plaintext. To tokenize a credit card number, the system must recognize the pattern of a credit card number. To do that, it must read the digits. There is no way to recognize a pattern without seeing the data that forms the pattern. You could encrypt the data, but then the pattern matcher cannot operate on it -- unless you use homomorphic encryption, at which point you are no longer doing tokenization. You are doing what H33 does.

Some vendors have proposed "format-preserving tokenization" where the token looks like the original data (same length, same character types). This does not solve the exposure problem; it just makes the tokens more convenient to use in existing systems. The tokenization engine still sees the original data.

Other vendors have proposed "vaultless tokenization" where the token-to-value mapping is derived from a key rather than stored in a database. This eliminates the vault as a single point of failure, but the tokenization engine still reads the original data to compute the mapping. The exposure point is the computation, not the storage.

Differential privacy has been proposed as an alternative: add noise to the data so that individual values cannot be recovered. But differential privacy degrades the data. A classifier running on differentially private data produces less accurate classifications. You are trading data quality for privacy. With homomorphic encryption, you get both: exact classification results with zero data exposure.

Secure enclaves (Intel SGX, ARM TrustZone) have been proposed as a way to run tokenization in a protected hardware environment where even the operating system cannot access the plaintext. But enclaves have been broken repeatedly: Spectre, Meltdown, SGAxe, Plundervolt, and a steady stream of side-channel attacks that extract secrets from supposedly protected memory. Hardware trust is not cryptographic trust. It is vendor trust, and vendors have a track record of shipping vulnerable hardware.

Federated learning keeps data distributed across devices but still requires models to see local data in plaintext during training and inference. It is a distribution strategy, not an encryption strategy. The data is plaintext somewhere.

The only approach that eliminates the plaintext window entirely is homomorphic encryption. And the only production system that chains CKKS classification into TFHE decision logic into encrypted policy enforcement is H33-Agent-Zero.

Legal: Privilege Classification Without Reading Contracts

Law firms face a specific and acute version of the tokenization problem. During litigation, firms must review millions of documents to identify privileged communications -- attorney-client privilege, work product doctrine, joint defense agreements. Missing a privileged document and producing it to the opposing party can be catastrophic: waiver of privilege, malpractice liability, sanctions.

Today, privilege review requires attorneys or AI-assisted review tools to read every document. The AI model must see the content to classify it as privileged or non-privileged. This means the AI system (and whoever operates it, including cloud vendors, managed review providers, and contract attorneys) has access to the most sensitive communications in the litigation: the strategy discussions, the legal advice, the work product analysis.

With H33-Agent-Zero, privilege classification happens on encrypted documents. The document content is encrypted on the firm's device. The classification model evaluates encrypted feature vectors and produces an encrypted privilege score. The TFHE decision logic applies the privilege threshold on the encrypted score and produces an encrypted tag: privileged or not privileged. The tag is decrypted only on the firm's device.

No cloud vendor, no managed review platform, no AI model provider ever sees the content of a single document. Privilege classification is performed with mathematical precision and zero exposure. The firm maintains complete control over its most sensitive work product while still benefiting from AI-assisted review at scale.

Healthcare: PHI Tagging Without Seeing Patient Data

HIPAA requires that Protected Health Information (PHI) be identified, tracked, and protected throughout its lifecycle. In practice, this means every system that handles healthcare data must be able to identify PHI elements: patient names, dates of service, diagnosis codes, medication lists, provider names, facility identifiers, and eighteen specific categories of identifiers defined in the Privacy Rule.

Today, PHI identification requires reading the data. A DLP system scans clinical notes, lab results, radiology reports, and other documents in plaintext to detect PHI patterns. The DLP system itself becomes a PHI processor, subject to HIPAA security requirements, breach notification obligations, and Business Associate Agreement terms. Every vendor in the PHI detection pipeline has access to patient data.

With Agent-Zero, the healthcare organization encrypts clinical documents on its own systems. The encrypted documents are sent to the classification engine, which identifies PHI categories on encrypted features. TFHE threshold logic determines which PHI categories are present and applies encrypted tags. The tags are returned to the healthcare organization and decrypted locally.

The classification engine never sees a patient name, a diagnosis code, or a medication list. It is not a PHI processor. It does not require a BAA. It cannot be breached because it never holds PHI. The compliance burden collapses from "every system in the pipeline must comply with HIPAA" to "only the endpoints that hold decryption keys must comply." This is a fundamental simplification of healthcare data governance.

Banking: PII Detection Without Customer Data Exposure

Banks operate under overlapping regulatory frameworks -- GLBA, CCPA/CPRA, GDPR, PCI DSS, BSA/AML -- each of which requires identifying and protecting different categories of customer data. PII detection is the first step in all of these compliance programs. The bank must know what data it has, where it lives, and what categories it falls into before it can apply the appropriate protections.

Today, PII detection means scanning databases, documents, emails, chat logs, and application data in plaintext. The detection system reads customer names, account numbers, Social Security numbers, addresses, phone numbers, email addresses, and financial records. It classifies each element and applies tags. Every system in this pipeline -- the scanner, the classifier, the tagging engine, the policy engine -- has access to customer PII.

With Agent-Zero, the bank encrypts data at the source. The PII detection pipeline operates entirely on encrypted data. The classifier identifies PII categories without seeing the underlying values. Policy tags are applied on encrypted classifications. The bank decrypts the tags locally and applies the appropriate regulatory controls.

The result is a PII detection system that itself does not process PII. The scanner cannot be breached for customer data because it never holds customer data. The compliance surface area shrinks to the encryption endpoints, which are the bank's own systems. Third-party vendors in the detection pipeline carry zero data risk because they never see the data.

The Attestation Layer

One question that arises with encrypted classification is: how do you prove that the classification was performed correctly? If no one can see the data or the intermediate results, how does an auditor verify that the system classified a document as PHI when it actually contained PHI?

H33 addresses this through its attestation infrastructure. Every encrypted computation produces a 74-byte attestation that proves the computation was performed correctly without revealing the inputs or outputs. This attestation is signed using post-quantum cryptographic signatures backed by three independent hardness assumptions, ensuring it remains valid even against quantum adversaries.

An auditor can verify that the classification model was applied to the encrypted input and that the output was derived correctly, without seeing the input, the output, or any intermediate state. The attestation proves correctness without revealing content. This is the auditing model that regulators need: verifiable compliance without data exposure.

The Traditional DLP Vendor's Dilemma

If you are a DLP vendor, the analysis above presents an existential question. Your entire product is built on the assumption that you must see data to protect it. Your scanning engine reads data. Your classification model reads data. Your policy engine reads data. Your tokenization vault stores data. Every component in your stack is a data processor, and your customers pay you to process their most sensitive information.

H33-Agent-Zero does not need to see the data at any point. The classification is performed on ciphertext. The decisions are made on ciphertext. The policy tags are computed on ciphertext. The enforcement is executed on ciphertext. There is no step at which a DLP vendor would need access to plaintext, which means there is no step at which a DLP vendor adds value by reading data.

The value proposition shifts from "trust us with your data so we can protect it" to "we protect your data without ever needing your trust." This is not an incremental improvement on tokenization. It is a replacement for the entire tokenization paradigm.

Threat Model Comparison

Consider the threat models that each approach defends against:

Tokenization: Protects against unauthorized access to data at rest (if the vault is separate). Does not protect against the tokenization engine itself being compromised. Does not protect against insiders with access to the tokenization pipeline. Does not protect against memory-scraping malware on the tokenization server. Does not protect against vendor supply chain attacks on DLP components.

Secure enclaves: Protects against software-level OS/hypervisor attacks (in theory). Does not protect against hardware side channels (Spectre, SGAxe, Plundervolt). Does not protect against the enclave manufacturer. Does not protect against physical access attacks. Trust is hardware-bound, not mathematical.

H33-Agent-Zero: Protects against server compromise (data is encrypted). Protects against insider access (no plaintext exists on the server). Protects against memory-scraping malware (memory contains only ciphertexts). Protects against vendor supply chain attacks (vendors process ciphertexts, not data). Protects against quantum adversaries (post-quantum encryption primitives). Does not protect against client-side compromise (the client must hold the decryption key). This is the correct trust boundary: the entity that owns the data holds the key. Everyone else sees ciphertext.

Cost and Latency

The obvious trade-off is computation cost. Homomorphic encryption is more computationally expensive than plaintext processing. A CKKS classification that takes one millisecond in plaintext might take hundreds of milliseconds on encrypted data. A TFHE decision tree that takes microseconds in plaintext takes tens of milliseconds on encrypted data.

But the comparison is wrong. The true cost of tokenization is not the compute time of the tokenization step. It is the cost of the data breach that tokenization fails to prevent. It is the cost of the HIPAA violation when the DLP vendor is compromised. It is the cost of the privilege waiver when the review platform leaks attorney-client communications. It is the cost of the GDPR fine when customer PII is exposed through a supply chain attack on the tokenization vendor.

Against those costs, the compute overhead of homomorphic encryption is negligible. The question is not "is encrypted classification slower than plaintext classification?" It is "is the compute cost of encrypted classification less than the expected cost of the data breaches that encrypted classification prevents?" For any organization processing sensitive data at scale, the answer is unambiguously yes.

The Pipeline Is the Product

H33-Agent-Zero is not a point solution. It is a pipeline that chains CKKS classification into TFHE decision logic into encrypted policy enforcement into post-quantum attestation. Each component exists to eliminate a specific exposure point in the traditional data protection stack.

CKKS eliminates the classification exposure point. TFHE eliminates the decision exposure point. Encrypted policy tags eliminate the enforcement exposure point. Post-quantum attestation eliminates the audit exposure point. The pipeline, taken as a whole, replaces tokenization entirely.

There is no tokenization vault because there are no tokens. There is no DLP scanning engine because there is no plaintext to scan. There is no pattern matcher because there is no data to match against. There is just encrypted data flowing through encrypted computation, producing encrypted results that only the data owner can read.

This is what the end of tokenization looks like. Not a better tokenization system. Not a faster tokenization engine. Not a more secure tokenization vault. The elimination of the need to tokenize at all, because the need to read data in order to protect data has been eliminated at the cryptographic level.

H33-Agent-Zero ships today. The pipeline is production-ready. The only question left is how long the industry will continue reading data to protect it when the technology to protect data without reading it already exists.