How does encrypted document classification work without reading the data?

H33-Upstream uses CKKS fully homomorphic encryption to run a compact classifier on encrypted feature vectors. The document is encrypted before classification begins. The model evaluates ciphertext and produces an encrypted classification vector. At no point does the system, the model, or H33 see the plaintext. The encrypted result is then processed by a policy engine that applies organizational rules — also on encrypted data — to produce TFHE-encrypted tags that are attached to the document before storage.

Can organizations define their own classification taxonomy?

Yes. H33 does not impose classification labels. The SDK lets customers define their own taxonomy — class names, default access tiers, retention periods, and jurisdictions. The classifier is trained against the customer's own labeled data using their proprietary categories. Examples include CLIENT_PII, BOARD_MATERIALS, PUBLIC_RELEASE, or any other label the organization requires. The trained model runs under FHE, so H33 never sees the taxonomy definitions or the training data.

What happens when the classifier is wrong?

Enterprise workflows require the ability to approve, correct, or override automated tags. In H33-Upstream, every correction is H33-74 attested — creating a verifiable audit trail of who changed what and when. Corrections marked for retraining become new labeled examples for the customer's private classifier, improving accuracy over time. Because the model trains on the customer's encrypted corrections, H33 never sees the documents or the corrections themselves.

How is this different from DLP scanning or LLM-based classification?

DLP tools scan plaintext documents using pattern matching — they must read the data to classify it. LLM-based classification sends documents to a third-party API, exposing content to the model provider. Both approaches require something to read the document. H33-Upstream classifies documents that remain encrypted throughout the entire pipeline — encryption, inference, tagging, and storage. The classifier never sees plaintext. The policy engine never sees plaintext. The tags themselves are TFHE-encrypted. No exposure at any stage.

Your Data Should Tag Itself — Without Anyone Reading It

The Classification Problem

Every enterprise classifies documents. CONFIDENTIAL. PHI. PII. INTERNAL. BOARD ONLY. EXPORT CONTROLLED. The labels matter because they determine who can access what, how long it gets retained, whether it can cross a border, and what happens when regulators ask for it. Classification is not optional for any organization that handles sensitive data — and in 2026, that means every organization.

The problem is not that enterprises lack classification systems. The problem is how those systems work. Today, document classification happens through three mechanisms, and all three share the same fundamental flaw.

Humans read the document and pick a label. This is the oldest approach and still the most common in regulated industries. A person opens the file, reads enough to understand what it contains, and assigns a classification tag. It is slow, inconsistent, and expensive. Two people reading the same document will frequently assign different labels. Attrition and training costs make it worse over time. But the real issue is not the inconsistency — it is that a human had to read the document at all. Every pair of eyes on a sensitive document is an exposure event. Every analyst with access to plaintext is a potential vector for data exfiltration, whether through malice, negligence, or compromise.

DLP tools scan the plaintext. Data Loss Prevention tools parse documents looking for patterns: Social Security numbers, credit card formats, medical record identifiers, keywords from regulatory dictionaries. They are faster than humans and more consistent, but they require the same thing humans do — access to the plaintext content. The DLP engine reads every word. It holds the unencrypted document in memory. If the DLP server is compromised, the attacker gets the same access the scanner had: everything. DLP vendors will tell you the data is "processed securely." What they mean is that the plaintext is handled according to their security policy. What they do not mean is that the data stays encrypted during scanning. It does not.

LLMs classify the content. The newest approach sends documents to a large language model — sometimes hosted by a third party — and asks it to assign labels. This is faster and more accurate than pattern matching for nuanced content. An LLM can understand that a document discussing "Patient 47's response to the experimental compound" is PHI even though it contains no Social Security numbers or ICD codes. But the document was sent, in full, to a model that ingests it. If that model is hosted by a third party, the content left your boundary. If the model is self-hosted, it still processes plaintext in memory. The LLM read the document. That is the exposure.

All three approaches require something to read the document in order to classify it. A human, a scanner, or a model. The content must be plaintext at the moment classification happens. That is the problem.

What If the Classifier Could Not Read the Document?

This is the question that drives H33-Upstream. Not "how do we classify documents faster" or "how do we make classification more accurate." The question is: can a classifier assign the correct label to a document it has never seen in plaintext?

The answer is yes, and the mechanism is CKKS fully homomorphic encryption.

H33-Upstream uses a compact, customer-trained classifier running under CKKS FHE. The model evaluates encrypted feature vectors extracted from the document. It produces an encrypted classification vector — a set of encrypted scores, one per class in the customer's taxonomy. The model never sees plaintext. The feature extractor never sees plaintext. The classification engine never sees plaintext. The document enters encrypted and stays encrypted through classification, tagging, and storage.

This is not differential privacy, where the system sees the real data but adds noise to the output. This is not tokenization, where a mapping table somewhere holds the original values. This is not anonymization, where the data is transformed but potentially reversible. The data is encrypted with a key that H33 does not hold. The computation happens on ciphertext. The result is ciphertext. The only entity that can read the classification output is the customer who holds the decryption key.

CKKS is the right FHE scheme for this workload because classification inherently tolerates approximate arithmetic. A confidence score of 0.9347 versus 0.9348 does not change the classification decision. CKKS provides efficient approximate real-number computation on encrypted data — exactly what a neural network classifier needs. The alternative schemes (BFV, TFHE) operate on exact integers or individual bits, which would require expensive encoding transformations that add latency without improving classification quality.

The Pipeline

The end-to-end pipeline for encrypted document classification in H33-Upstream works as follows.

H33-Upstream Classification Pipeline

1. Document enters the system
2. Feature extraction produces a vector representation
3. Vector is encrypted under CKKS FHE
4. Encrypted vector is evaluated by the customer's classifier model
5. Classifier produces an encrypted classification vector
6. AI recommendation is attested via H33-74
7. Policy engine applies organizational rules on encrypted data
8. TFHE-encrypted tags are created based on policy output
9. Encrypted tags are attached to the encrypted document
10. The complete bundle is committed with three post-quantum signature families

The user uploads. The system handles everything. At no point does any component in the pipeline operate on plaintext. The document does not need to be decrypted for feature extraction because the feature extractor operates on a pre-encrypted representation. The classifier does not need plaintext because CKKS supports the matrix operations and activation functions that neural classifiers require. The policy engine does not need plaintext because it evaluates boolean conditions on encrypted classification outputs. The tags themselves are TFHE-encrypted because TFHE provides bit-level boolean operations that enforce access-tier logic without revealing the tag values to the infrastructure.

The three post-quantum signature families that attest the final bundle ensure that the classification decision — and the chain of custody from document ingestion to tag assignment — cannot be forged, even by a quantum adversary. This is not just encrypted classification. It is attested, policy-enforced, post-quantum-signed encrypted classification.

Customer-Defined Taxonomy

H33 does not decide what your classifications are. This is a deliberate architectural decision, not a feature gap. Classification taxonomies are organization-specific. A hospital's labels are different from a law firm's. A defense contractor's retention rules have nothing in common with a fintech's. Imposing a universal taxonomy would force customers to map their real-world categories onto H33's abstractions, creating translation errors that defeat the purpose of automated classification.

Instead, the SDK lets customers create their own taxonomy. Each taxonomy definition includes:

Class names: Whatever labels the organization uses. CLIENT_PII, BOARD_MATERIALS, PUBLIC_RELEASE, SEC_FILING_DRAFT, EXPORT_CONTROLLED, ATTORNEY_PRIVILEGED — the system accepts any string.
Default access tiers: Each class maps to an initial access tier that determines who can read, modify, or share documents bearing that tag.
Retention periods: How long documents with this classification must be preserved before destruction is permitted.
Jurisdictions: Geographic or regulatory domains that constrain where the document can be processed or stored.

The classifier is trained against the customer's own labeled data. This training happens on the customer's side using the H33 SDK. The trained model weights are encrypted before being deployed to the classification pipeline. H33 never sees the taxonomy definitions. H33 never sees the training data. H33 never sees the model weights in plaintext. The customer defines the world, trains the classifier, and deploys it — all without H33 knowing what the labels mean or what the documents contain.

Three Confidence Modes

The classifier produces an encrypted score vector — one floating-point confidence value per class, all encrypted under CKKS. This vector is useless to H33 because H33 does not hold the decryption key. But the customer needs to make a decision based on this vector, and different customers have different requirements for how that decision gets made.

H33-Upstream provides three honest confidence modes:

(a) Hard Classification. This is the default. The system evaluates which encrypted score is highest using homomorphic comparison operations and outputs a single label: "Document matched: CLIENT_PII." No score is revealed. No confidence number leaks. The customer gets a classification decision and nothing else. This mode is appropriate for organizations that want fully automated classification with no human review of confidence margins.

(b) Threshold Proof. The classifier's encrypted score is compared against a customer-defined threshold using TFHE boolean comparison circuits. The output is a binary proof: the score exceeded the threshold, or it did not. The actual score is never revealed to anyone — not to H33, not even to the customer's infrastructure. Only the binary pass/fail result is returned. This mode is useful for compliance workflows where the organization needs to prove that classification confidence met a regulatory minimum without exposing the exact model output.

(c) Customer Decrypt. The encrypted classification vector is returned directly to the customer. The customer decrypts it at their own boundary using their own key. They see the full vector of confidence scores — 0.94 for CLIENT_PII, 0.03 for PUBLIC_RELEASE, 0.02 for BOARD_MATERIALS, and so on. This mode gives the customer maximum visibility but requires them to handle the decryption and decision logic themselves. It is appropriate for organizations with their own post-classification workflows or those integrating H33 output into existing DLP systems.

All three modes operate on the same encrypted pipeline. The difference is where and how the boundary decision happens — not whether the data is exposed. In all three modes, H33 never sees the plaintext document, the plaintext scores, or the plaintext classification result.

Policy Engine

Classification alone is not enough. Organizations need rules that constrain what happens after classification. The H33 policy engine runs on encrypted classification outputs and enforces organizational logic that can only upgrade, never downgrade.

This one-directional enforcement is critical. A policy can say "all SEC-jurisdiction documents get minimum access tier 5." It cannot say "reduce access tier for SEC documents from 5 to 3." A policy can say "all PHI gets 7-year retention." It cannot say "override the 7-year PHI retention with 90-day retention." The engine only tightens constraints. This eliminates an entire class of policy misconfiguration bugs where a well-meaning administrator accidentally loosens access controls on sensitive data.

Policy rules are expressed as boolean conditions on encrypted classification fields. The engine evaluates these conditions using TFHE boolean circuits — encrypted AND, OR, NOT, and comparison operations — without decrypting the classification output or the policy parameters. The result is a set of encrypted policy actions (access tier adjustments, retention overrides, jurisdiction constraints) that are applied to the encrypted tags before they are attached to the document.

The AI classifies. The policy enforces. Both run on encrypted data. Both are H33-74 attested. The audit trail records that classification happened, that policy was applied, and that the result was signed — without recording what the classification was or what the policy decided. The audit trail proves process compliance without leaking content.

Human Override and Feedback Loop

Fully automated classification is a goal, not a starting condition. Enterprise buyers need the ability to approve, correct, or override tags. Every real-world deployment of automated classification includes a human review layer, and pretending otherwise would make H33-Upstream unusable for the organizations that need it most.

The override workflow works like this. A human reviewer sees that a document was tagged PUBLIC_RELEASE but should have been tagged BOARD_MATERIALS. The reviewer submits a correction through the SDK. The correction is H33-74 attested — the reviewer's identity, the timestamp, the original tag, and the new tag are all cryptographically bound into an immutable attestation record. The corrected tag replaces the original. The attestation chain records the full history: original classification, correction, who made it, and when.

Corrections that the customer marks for retraining become new labeled examples for the customer's private classifier. The model improves from corrections without H33 ever seeing the documents. The retraining loop is entirely customer-side: the customer's SDK collects encrypted corrections, decrypts them at the customer's boundary, updates the training set, retrains the model, re-encrypts the updated weights, and redeploys. H33's infrastructure transports encrypted model updates. It never sees the corrections, the training data, or the model weights in plaintext.

This creates a virtuous cycle. The classifier gets better over time because it learns from real corrections made by the people who understand the documents best. But the corrections never leave the customer's encryption boundary. The model improves. The data stays private. The two goals are not in tension.

What This Replaces

H33-Upstream replaces three incumbent approaches to document classification, each of which requires plaintext access to function.

Manual classification. A human reads the document and picks a label. This requires plaintext access, takes minutes per document, and produces inconsistent results across reviewers. Cost scales linearly with document volume. For an organization processing 10,000 documents per day, manual classification requires a dedicated team. For 100,000 documents per day, it is not feasible at any budget. H33-Upstream eliminates the human reader entirely. Documents are classified in milliseconds without anyone seeing the content.

DLP scanning. A tool reads the plaintext document and pattern-matches against known sensitive data formats. This requires plaintext access, misses context-dependent sensitivity (a document can be highly confidential without containing a single SSN), and creates a high-value target in the DLP server itself. If your DLP engine is compromised, the attacker has access to every document the scanner has processed. H33-Upstream replaces pattern matching with learned classification on encrypted data. The classification infrastructure holds only ciphertext. Compromising it yields nothing.

LLM classification. A language model reads the document and assigns labels based on semantic understanding. This produces the most accurate labels of the three approaches, but it requires sending the document — in full, in plaintext — to a model. If the model is a third-party API, the data has left your perimeter. If the model is self-hosted, it still processes plaintext in memory. Either way, the content was read. H33-Upstream achieves comparable classification accuracy for well-defined taxonomies using a compact CKKS-encrypted classifier. The model operates on ciphertext. It never reads the document. The accuracy comes from customer-specific training, not from a foundation model's general knowledge.

All three approaches expose the data at the moment of classification. H33-Upstream does not.

The Architectural Guarantee

The guarantee H33-Upstream provides is not "we promise not to read your data." Promises are policies. Policies can be changed, violated, or circumvented. The guarantee is architectural: the system cannot read your data because it does not hold the decryption key. The classifier runs on ciphertext. The policy engine runs on ciphertext. The tags are encrypted. The bundle is signed with three post-quantum signature families. Every step is H33-74 attested.

This is not a trust model where you evaluate H33's security posture and decide whether to believe the company will protect your data. This is a model where the cryptography enforces the privacy guarantee regardless of H33's behavior. Even if H33's infrastructure were fully compromised — every server, every database, every employee credential — the attacker would obtain only ciphertext. The documents, the classifications, the tags, and the policy decisions would remain encrypted. The keys never leave the customer's boundary.

For organizations that handle truly sensitive data — legal matters under attorney-client privilege, classified government documents, patient records under HIPAA, financial filings under SEC quiet periods — this architectural guarantee is the only kind that matters. The question is not whether you trust your vendor. The question is whether the architecture makes trust unnecessary.

H33 is built so you do not have to trust us. The math handles it.

H33-Upstream — Encrypted Classification at a Glance

FHE scheme: CKKS for inference, TFHE for tags and boolean policy
Routing: H33-FHE-IQ selects optimal parameters per workload
Attestation: H33-74 (74 bytes, three PQ signature families)
Taxonomy: Customer-defined, customer-trained, customer-encrypted
Confidence modes: Hard classification, threshold proof, customer decrypt
Policy: Upgrade-only boolean rules on encrypted outputs
Override: H33-74 attested, retraining-eligible corrections
Storage: Encrypted at rest with PQ-signed bundle commitment
Test coverage: 20,000+ tests across the H33 platform

Patent pending — H33 substrate. The classification pipeline, policy engine, and encrypted tagging workflow described in this post are covered by pending patent claims. 6 patents pending, 250+ claims.

Ready to Go Quantum-Secure?

Start protecting your users with post-quantum authentication today. 1,000 free auths, no credit card required.

Get Free API Key →