back to blog

Redacting PII from documents without destroying their structure

documentspseudonymizationPDFDOCXengineering

Document processing pipelines have a PII problem that is distinct from the LLM gateway problem. When you are ingesting employee records, legal contracts, or patient intake forms at scale, the question is not just "does this text contain sensitive data" — it is "how do I remove that data without making the document useless for everything downstream."

Most teams reach for the simplest tool available: find-and-replace with a generic placeholder — [REDACTED], ****, or similar — that strips the sensitive value without preserving any structural information. It works. It also quietly breaks a surprising number of things.

Why [REDACTED] is the wrong default

Consider an employment contract. The name "Sarah Chen" appears 23 times across 14 pages — in headers, in signature blocks, in "the Employee agrees that..." clauses. A simple redaction replaces each occurrence independently. The result is a document that reads coherently in isolation but loses all coreference structure.

If you are running any downstream processing on the redacted output — classification, extraction, comparison against other documents — you have now lost the signal that the same entity was referenced 23 times. You have also broken sentence-level parsing in subtle ways. Some NLP pipelines are fine with [REDACTED]. Others see it as an unknown token and behave unpredictably.

The second problem is format survival. PDF and DOCX files are not plain text — they are structured containers with layout, fonts, paragraph boundaries, and metadata. Naive redaction tools that operate on extracted text do not write the sanitized content back into the original format. You get a .txt file where you needed a .docx. Your document management system now has a format mismatch it was not built to handle.

Pseudonymization: preserving structure while removing identity

The alternative is pseudonymization. Instead of replacing "Sarah Chen" with [REDACTED], you replace every occurrence with PERSON_1. The second distinct person in the document becomes PERSON_2. Email addresses become EMAIL_1, EMAIL_2, and so on.

This approach has two properties that matter for document pipelines:

Coreference is preserved. All 23 references to "Sarah Chen" are consistently labeled PERSON_1. Downstream code that counts entity mentions, checks for consistency across clauses, or builds a knowledge graph from the document still works correctly. The semantic relationships in the document survive the redaction step.

The mapping is reversible. The redaction API returns a map alongside the sanitized document: { "PERSON_1": "Sarah Chen", "EMAIL_1": "schen@example.com" }. If your application needs to re-identify specific entities after processing — for audit trails, for display to authorized users, for compliance logging — that map gives you a clean path back to the original data without storing the raw PII in your processing layer.

Format support

Expunct handles the three formats that cover most document ingestion pipelines:

DOCX. The redacted output is a valid DOCX file with the same structure, styles, and layout as the input. The PII has been replaced in place.

PDF. For PDFs with extractable text layers, redaction works the same way — text is replaced within the PDF structure. For scanned documents with no text layer, Expunct runs OCR first to locate entities, then overlays redaction blocks on the original image.

Images. Scanned pages, photos of ID documents, whiteboards — any image that contains text can be processed. OCR identifies the content, entity detection runs on the extracted text, and the sensitive regions are masked in the returned image.

A concrete example

Here is what document redaction looks like using the Python SDK:

from expunct import Expunct

client = Expunct(api_key="your-key")

with open("employee-records.docx", "rb") as f:
    result = client.redact_file(f, filename="employee-records.docx")

print(result.redacted_text)
# "PERSON_1 has been employed at ORGANIZATION_1 since DATE_1..."

# The entity map for re-identification if needed
print(result.entity_map)
# { "PERSON_1": "Sarah Chen", "ORGANIZATION_1": "Acme Corp", "DATE_1": "March 2019" }

# Save the redacted DOCX
with open("employee-records-redacted.docx", "wb") as f:
    f.write(result.redacted_file)

The redact_file call handles format detection automatically based on the filename. The same call works for PDFs and images — you do not need separate code paths per format.

Where this fits in a pipeline

The typical integration point is at document ingest, before anything hits persistent storage. A document arrives from an upload endpoint or an S3 event. Before it gets written to your document store or fed into an indexing pipeline, it passes through Expunct. The sanitized version is what gets stored. The original is either discarded or stored separately with appropriate access controls.

This means your search index, your vector store, your processing queue — all of them see pseudonymized data from the start. You have reduced the blast radius of a data breach at the storage layer, not just at the output layer.

What to be realistic about

Pseudonymization is not anonymization. The entity map still exists somewhere. If you are storing that map alongside the redacted document, you have not solved the privacy problem — you have reorganized it. The value is in separating the map from the document so that access controls can be applied independently, and so that bulk processing of documents does not require access to the sensitive data.

Entity detection also has limits. A model fine-tuned on common PII types will miss domain-specific identifiers: internal employee IDs, proprietary product codes, custom account number formats. If your documents contain entities that do not match standard PII patterns, you will want to test coverage before relying on it in production.

For most HR tech, legal tech, and document management use cases, the standard entity types cover the majority of the exposure surface. The 1 million token free tier is enough to run a meaningful volume of real documents before committing to a paid plan.

The API reference and SDK installation instructions are at docs.expunct.ai.