Announcing langchain-textual: PII redaction and synthesis for LangChain on Tonic Textual
好的,我现在需要帮用户总结一篇文章的内容,控制在100字以内。用户已经提供了文章的英文内容,并且希望我用中文来总结。首先,我要仔细阅读文章,理解其主要内容和重点。 这篇文章主要讨论了AI组织在使用非结构化数据时面临的隐私和合规性问题。非结构化数据如支持票证、临床笔记等虽然对模型训练有效,但处理起来存在隐私风险。文章介绍了langchain-textual工具包,它通过集成Tonic Textual的PII检测和数据合成技术,帮助组织安全地使用这些数据,同时遵守隐私法规。 接下来,我需要将这些要点浓缩成100字以内的中文摘要。要注意涵盖主要问题、解决方案以及工具的功能。同时,避免使用像“文章内容总结”这样的开头词,直接描述内容即可。 最后,检查字数是否符合要求,并确保语言简洁明了。 </think> AI组织在利用非结构化数据训练模型时面临隐私合规与数据实用性的矛盾。langchain-textual工具包通过集成Tonic Textual的PII检测、红action和数据合成技术,帮助组织安全处理文本、JSON、HTML等格式的数据,在保证隐私的同时提升模型训练效率。 2026-3-20 01:55:1 Author: securityboulevard.com(查看原文) 阅读量:4 收藏

Organizations building with AI are operating inside a growing paradox: the unstructured data that makes models effective—support tickets, clinical notes, chat logs, and documents—is the same data that is hardest to use safely. Privacy regulations, internal governance, and the risk of exposing sensitive information slow or block access to the very datasets that power model training, retrieval pipelines, and agent workflows. As a result, teams are forced to choose between data utility and compliance, creating a persistent bottleneck across AI systems.

With the introduction of langchain-textual, that tradeoff no longer has to exist. By bringing Tonic Textual’s transformer-based PII detection, redaction, and high-fidelity data synthesis directly into LangChain workflows, organizations can safely use real-world data across training, RAG, and agent pipelines—without compromising privacy or velocity.

This challenge shows up across every stage of modern AI systems—and langchain-textual is designed to address it wherever sensitive data flows:

  • Model training and fine-tuning, where real-world data produces better models — but that data contains names, SSNs, and account numbers that can’t appear in training sets.
  • RAG pipelines, where documents are chunked, embedded, and stored in vector databases. PII persists at every layer: the source documents, the chunks, the embeddings, and the retrieval results that eventually reach the LLM.
  • LLM proxies, where organizations need to scrub PII from prompts before they leave internal infrastructure and from completions before they reach end users.
  • Agent workflows, where tools pull data from databases, APIs, and files — each step potentially surfacing and forwarding PII downstream.
  • Regulatory compliance — HIPAA de-identification under both Safe Harbor and Expert Determination, GDPR right-to-erasure, PCI cardholder data requirements.
  • Lower environments, where QA teams and developers need data that behaves like production without carrying the regulatory exposure.

The common thread across all of these is that regex and rule-based approaches aren’t sufficient. A pattern like \d{3}-\d{2}-\d{4} will match anything that looks like a Social Security number — including dates, product codes, and case numbers. The question isn’t whether a string matches a pattern. It’s whether, in context, “her social is 123-45-6789” means something different from “case number 123-45-6789.” That requires NER models that understand language, not string matching.

Tonic Textual provides exactly this — transformer-based PII detection and transformation across text, JSON, HTML, PDFs, images, and tabular data. With langchain-textual, those capabilities are now available as standard LangChain tools.

What Tonic Textual brings to the table

It’s worth understanding what Textual does before looking at the LangChain integration, because the integration inherits all of it.

Textual’s NER model identifies 46+ entity types across 50+ languages. These aren’t just the obvious ones like email addresses and phone numbers. The model detects names (given and family, separately), dates of birth, occupations, healthcare IDs, routing numbers, IP addresses, and more — with a confidence score for each detection.

What happens after detection is where Textual differentiates itself. There are two modes:

Redaction replaces detected PII with labeled placeholders:

1Input:  "Contact John Smith at [email protected]"
2Output: "Contact [NAME_GIVEN_a1b2] [NAME_FAMILY_c3d4] at [EMAIL_ADDRESS_e5f6]"

This is the safest option when the goal is to guarantee that no real PII appears in the output. The placeholders are tagged with their entity type and a consistent identifier, which means you can track which replacements correspond to the same original entity across a document.

Synthesis replaces PII with realistic fake data:

1Input:  "Contact John Smith at [email protected]"
2Output: "Contact Maria Chen at [email protected]"

This distinction matters more than it might seem at first glance. If you’re training a language model on de-identified clinical notes, every [NAME_GIVEN_xxxx] placeholder is a token the model learns to predict — one that has no relationship to how real names appear in text. The model’s understanding of sentence structure around names degrades. Synthesized data preserves the format and distribution of the original. A synthesized email is still a valid email. A synthesized date has the right format. Downstream models and analytics work correctly on the transformed data.

For organizations navigating HIPAA, this is particularly relevant. Expert Determination requires a qualified statistical expert to certify that the risk of re-identification is very small. Synthesis that preserves statistical properties while eliminating re-identification risk is a powerful tool in that process — far more so than blanket redaction that destroys data utility.

Textual handles all of this across multiple formats. In JSON, it understands the difference between keys (typically not PII) and values (often PII). In HTML, it preserves markup structure while redacting text content. For PDFs and images, it uses OCR to detect text and then redacts in the rendered output. It runs in the cloud or self-hosted on your own infrastructure.

The LangChain integration

langchain-textual wraps Textual’s capabilities into five LangChain tools:

Tool Input Use for
TonicTextualRedactText Plain text string Raw text, .txt file contents
TonicTextualRedactJson JSON string Raw JSON, .json file contents
TonicTextualRedactHtml HTML string Raw HTML, .html/.htm file contents
TonicTextualRedactFile File path PDFs, images (JPG, PNG), CSVs, TSVs
TonicTextualPiiTypes None List all supported PII entity types

These are standard LangChain BaseTool subclasses. They work with any agent, chain, or tool-calling model. Installation is a single line:

pip install langchain-textual

The separation into distinct tools is deliberate. When an LLM agent has access to multiple tools, it selects which one to call based on the tool’s name, description, and input schema. A single tool that handles text, JSON, HTML, and files via a format parameter forces the model to make two decisions — “should I redact?” and “what format is this?” — when it should only need to make one. Separate tools with clear descriptions let the model match the user’s intent to the right tool directly.

How the tools are structured

All redaction tools share a common base class that handles client initialization and configuration:

1class _BaseTonicTextual(BaseTool):
2    client: TextualNer = Field(default=None)
3    tonic_textual_api_key: SecretStr = Field(default=SecretStr(""))
4    tonic_textual_base_url: str | None = None
5    generator_default: Literal["Off", "Redaction", "Synthesis"] | None = None
6    generator_config: dict[str, Literal["Off", "Redaction", "Synthesis"]] = Field(
7        default_factory=dict
8    )

A Pydantic model_validator runs before instantiation, reading the API key from the constructor argument or the TONIC_TEXTUAL_API_KEY environment variable and initializing the Textual client. Each tool subclass then implements _run with a single call to the appropriate Textual API method — client.redact() for text, client.redact_json() for JSON, and so on.

Configuration options like generator_default and generator_config are defined on the base class, so every tool inherits them. A shared _build_kwargs() method assembles these into the keyword arguments that every Textual API method accepts. This keeps each tool’s _run method focused on its format-specific logic.

Teaching the agent which tool to use

When an LLM receives a tool definition, it sees three things: the tool’s name, its description, and its input schema — including the description field on each parameter. These descriptions are instructions the model reads before deciding how to call the tool.

We define explicit Pydantic schemas for each tool’s input:

1class _RedactTextInput(BaseModel):
2    text: str = Field(
3        description=(
4            "Plain text that may contain PII. "
5            "For .txt files, read the file first and pass the contents here."
6        )
7    )
8
9class _RedactFileInput(BaseModel):
10    file_path: str = Field(
11        description=(
12            "Absolute path to the file to redact. "
13            "Supported types: JPG, PNG, PDF, CSV, TSV. "
14            "Do NOT use for .txt, .json, .html, or .htm files — use the "
15            "dedicated text, JSON, or HTML redaction tools instead."
16        )
17    )

By stating what each parameter expects and explicitly noting what not to use each tool for, we significantly reduce the chance the agent picks the wrong tool. The text tool says “Do NOT use this tool for JSON, HTML, or binary files.” The file tool says “Do NOT use for .txt, .json, .html, or .htm files.” These negative instructions are as important as the positive ones — models are good at following explicit boundaries.

When the agent picks the wrong tool anyway

Even with good descriptions, agents sometimes make mistakes. The question is what happens next. If a tool returns a raw Python traceback, the agent has nothing to work with. If it returns an actionable error message, the agent can self-correct.

Each tool validates its input and redirects when appropriate:

1# In the text tool: if the input is valid JSON, redirect
2try:
3    json.loads(text)
4    return (
5        "Error: input looks like JSON, not plain text. "
6        "Use tonic_textual_redact_json for .json files (read contents first)."
7    )
8except (json.JSONDecodeError, TypeError):
9    pass
10# In the file tool: if the extension is .html, redirect
11if ext_lower in {".html", ".htm"}:
12    return (
13        "Error: .htm files are not supported by this tool. "
14        "Use tonic_textual_redact_html for .html/.htm files (read contents first)."
15    )

This works because LLM agents treat tool outputs as observations in their reasoning loop. An error that names the correct tool is just another observation the agent can act on. In practice, the agent calls the right tool on the next step, and the user never sees the intermediate error.

Putting it together: a working agent

Here’s a complete agent with PII redaction capabilities:

1from langchain_textual import (
2    TonicTextualRedactText,
3    TonicTextualRedactFile,
4    TonicTextualPiiTypes,
5)
6from langchain_openai import ChatOpenAI
7from langgraph.prebuilt import create_react_agent
8
9llm = ChatOpenAI(model="gpt-4o-mini")
10tools = [
11    TonicTextualRedactText(),
12    TonicTextualRedactFile(),
13    TonicTextualPiiTypes(),
14]
15agent = create_react_agent(llm, tools)

Two environment variables (TONIC_TEXTUAL_API_KEY and OPENAI_API_KEY), six lines of code, and you have an agent that can redact PII from text and files.

When a user asks “Redact this: My name is John Smith and my email is [email protected],” the agent calls tonic_textual_redact with the text. Textual’s NER model identifies John Smith as NAME_GIVEN + NAME_FAMILY and [email protected] as EMAIL_ADDRESS, and the tool returns the redacted text.

When a user asks “Redact the file /tmp/medical_record.pdf,” the agent calls tonic_textual_redact_file. Behind the scenes, the tool uploads the file to Textual, waits for the redaction job to complete, downloads the result, and writes it to /tmp/medical_record_redacted.pdf. The agent reports back with the output path.

When a user asks “What PII types can you detect?” the agent calls tonic_textual_pii_types, which returns all 46 entity types from the SDK’s PiiType enum — no API call needed, no latency.

Controlling redaction at the entity level

The default behavior redacts everything Textual detects. But in practice, you often want different handling for different entity types. Names might be safe to synthesize for analytical purposes, while SSNs should always be hard-redacted. Credit card numbers get redacted; organization names might be left alone entirely.

generator_config provides this control:

1tool = TonicTextualRedactText(
2    generator_default="Off",
3    generator_config={
4        "NAME_GIVEN": "Synthesis",
5        "NAME_FAMILY": "Synthesis",
6        "EMAIL_ADDRESS": "Redaction",
7        "US_SSN": "Redaction",
8    },
9)

This configuration synthesizes names (so text reads naturally), hard-redacts emails and SSNs (for maximum safety), and leaves everything else untouched. The generator_default of “Off” means any entity type not listed in generator_config passes through unchanged.

This composability is essential for compliance workflows. HIPAA’s Safe Harbor method specifies 18 identifier categories that must be removed or generalized. With generator_config, you can map each of those categories to the appropriate handling — redaction for direct identifiers, synthesis for quasi-identifiers where you need to preserve analytical utility, and “Off” for categories that don’t apply to your data.

Not sure which entity types are available? That’s what TonicTextualPiiTypes is for:

1from langchain_textual import TonicTextualPiiTypes
2TonicTextualPiiTypes().invoke("")
3# "NUMERIC_VALUE, LANGUAGE, MONEY, ..., US_SSN, CREDIT_CARD, EMAIL_ADDRESS, ..."

What’s next

This initial release covers the core redaction workflows — text, JSON, HTML, and binary files. We’re continuing to expand format support and add capabilities.

The package is open source under the MIT license:

A self-contained agent example lives in the examples/ directory of the repo — clone it, set two environment variables, and run uv run agent.py to see it in action.

For the full Tonic Textual platform — including the web UI, dataset management, custom entity training, and enterprise deployment — visit https://textual.tonic.ai/signup.


文章来源: https://securityboulevard.com/2026/03/announcing-langchain-textual-pii-redaction-and-synthesis-for-langchain-on-tonic-textual/
如有侵权请联系:admin#unsafe.sh