Every time you pipe a user message into GPT-4, Claude, or any other LLM, you are making a decision about what data leaves your system. Most of the time, nobody thinks about it. Then a user pastes their Social Security number into a support chat, it goes straight to the model, lands in a training dataset or a log file, and the question of "how did this happen" becomes very uncomfortable very fast.
This is the problem Expunct solves.
The problem with PII in AI pipelines
PII shows up everywhere in unstructured text: support tickets, uploaded documents, voice recordings, database exports, code comments. When developers build AI features on top of this data — summarization, classification, RAG retrieval, function calling — that PII travels with it.
The naive fix is to regex-scrub obvious patterns before sending. This catches some email addresses and phone numbers but misses full names, account numbers, bearer tokens buried in log dumps, crypto wallet addresses, and dozens of other entity types. It also creates a different problem: if you replace "John Smith" with a generic placeholder — ****, <PERSON>, or [REDACTED] — you have destroyed the sentence structure that the model uses to understand context. Your summarization quality drops. Your classification accuracy drops. You have traded a privacy risk for a capability regression.
There is also the DIY NER route: train or fine-tune a Named Entity Recognition model, run it in your inference stack, maintain it as entity types evolve. This is a real engineering project, not a weekend task, and it is not what most teams should be spending time on.
What Expunct does differently
Expunct is a REST API — and a Python/Node SDK — for PII detection and redaction. Send it text, a PDF, a DOCX, an image, or an audio file. Get back a sanitized version and a list of everything that was found.
Two design decisions separate it from simple redaction tools:
Pseudonymization, not masking. By default, Expunct replaces entities with labeled placeholders: PERSON_1, EMAIL_1, CREDIT_CARD_1. If the same person appears three times in a document, all three occurrences get the same label. The structure of the text is preserved. An LLM processing the sanitized version still understands that the same entity appears multiple times — it just does not know who that entity is. You can re-identify later if you need to, using the mapping Expunct returns alongside the redacted output.
Breadth of entity coverage. Expunct detects 27+ entity types out of the box: names, emails, phone numbers, SSNs, credit card numbers, IBANs, IP addresses, dates of birth, passport numbers, driver's license numbers, crypto wallet addresses, and credentials including Bearer tokens and API keys. This matters for AI pipelines specifically because credentials leak into text constantly — a user pastes a curl command into a support chat, a log file gets uploaded to a retrieval system.
The AI gateway use case
The most common way teams use Expunct is as a lightweight sanitization layer before LLM calls:
from expunct import Expunct
client = Expunct(api_key="your-key")
result = client.sanitize_text("My name is John Smith, email john@example.com")
print(result.redacted_text) # "My name is PERSON_1, email EMAIL_1"
Drop this into your LLM middleware. The model gets sanitized input. Your logs do not contain raw PII. The entity map comes back with the result if you need it for downstream processing.
Beyond text, the same API handles PDFs, DOCX files, images (via OCR), and audio files (transcription then redaction). If you are building a document ingestion pipeline for RAG, you can sanitize at the ingest step before anything hits your vector store.
MCP server and Claude Code integration
Expunct ships an MCP server, which means it works natively in Claude Desktop and in any MCP-compatible AI workflow. If you use Claude Code, there is a one-liner install to add the Expunct skill directly to your coding environment.
This is useful for teams that want PII sanitization in their AI-assisted workflows without writing any glue code.
What it is not
Expunct is a detection and redaction tool. It is not a compliance framework, and we are not going to claim otherwise. It does not make you HIPAA certified or SOC 2 compliant. What it does is reduce the surface area of PII exposure in your pipelines in a way that is straightforward to integrate and audit.
If you are in a regulated industry and need formal compliance guarantees, you need more than an API. But if you are a development team that wants to stop sending raw user data to third-party models — and wants to do it in a day rather than a sprint — Expunct is built for that.
Free tier
The free tier is 1 million tokens per month with no credit card required. For most teams evaluating whether this fits into their pipeline, that is enough to run real load against real data. Paid plans scale from there.
SDKs are available for Python (expunct) and Node.js (@expunct/sdk). The REST API works with any language.
Get started
Sign up at expunct.ai — no credit card, no sales call. The API reference and quickstart are in the docs. If you run into anything, open an issue or reach out directly.