Creating unstructured data pipelines for retrieval augmented generation
好的,我现在需要帮用户总结一篇文章的内容,控制在100字以内。用户的要求很明确,不需要特定的开头,直接写文章描述即可。 首先,我得仔细阅读文章内容。文章主要介绍了一个名为Tonic Textual的工具,它最初的功能是生成非结构化文本和图像文件的红acted版本。后来,根据早期客户的反馈,他们发现除了数据隐私外,准备数据用于生成式AI工具也是一个主要障碍。因此,他们扩展了Tonic Textual的功能,新增了一个Pipeline流程。 这个Pipeline流程可以处理多种文件类型,如文本、Word文档、PDF、Excel等,并提取文本、检测实体、转换为Markdown格式,并生成包含实体列表和Markdown内容的JSON文件。用户可以通过Textual应用查看处理结果,包括原始内容、检测到的实体以及JSON输出。此外,还可以使用Python SDK来检索处理结果,并对检测到的实体进行红action或合成。 总结一下,文章主要介绍了Tonic Textual的新功能:Pipeline流程如何将非结构化数据转化为适合RAG(检索增强生成)的数据,并确保数据安全。用户可以通过多种方式获取和使用处理后的数据。 接下来,我需要将这些信息浓缩到100字以内。重点包括:Tonic Textual的功能扩展、Pipeline流程的作用、支持的文件类型、处理步骤以及结果输出方式。 最后,确保语言简洁明了,直接描述文章内容。 </think> Tonic Textual扩展功能以支持将非结构化文本和图像文件快速转化为AI可用格式。其新Pipeline流程可提取文本、检测实体并生成Markdown和JSON输出以供RAG使用。用户可通过应用或Python SDK访问结果,并可选择红action或合成敏感数据以确保隐私安全。 2026-2-24 18:39:5 Author: securityboulevard.com(查看原文) 阅读量:0 收藏

Our initial release of Tonic Textual focused on generating redacted versions of unstructured text and image files. This is a great workflow for companies trying to make safe use of their unstructured data in their data workflows, including AI model training.

Working closely with our early customers, we learned that in addition to data privacy, preparing the data for use with generative AI tools is a major impediment that affects time-to-value for enterprise AI use cases.

So we expanded Tonic Textual’s functionality to serve that new use case in tandem with the first, so that you can take your unstructured data from raw to AI-ready in just a few minutes, while you ensure that sensitive data is protected.

Introducing the Textual pipeline workflow, Textual’s newest capability that allows you to use the same types of source files to produce Markdown versions of your document that you can import into a vector database for RAG. Textual’s built-in redaction and synthesis features enable you to ensure that your RAG content does not include sensitive values. At the same time, you can use the entities that Textual detects to enrich your embeddings with additional information to help improve retrieval.

About the pipeline process

A Textual pipeline is a collection of files—including plain text files, Word documents, PDFs, Excel spreadsheets, images, and more. A pipeline can process either files that you upload from a local filesystem, or files and folders that you select from cloud storage.

After you create your pipeline and select your files, to produce the RAG-ready output, Textual:

  1. Extracts the raw text from the files.
  2. Detects the entities in the files. These are the same types of entities that are redacted or synthesized in the Textual redaction workflow.
  3. Converts the extracted text to Markdown.
  4. Generates JSON files that contain the detected entity list and the Markdown content.

Viewing the file processing results in Textual

For each processed file, the Textual application provides views of the results.

Original file content

The file details include the file content in both Markdown:

and in a rendered format:

Detected entities in the file

Textual also displays a version of the file text that highlights the detected entities in the file. For each detected entity, Textual also displays the entity type—names, identifiers, addresses, and so on.

JSON output

Textual then provides the JSON output, which includes both the Markdown content and the list of detected entities.

Tables and key-value pairs

For PDFs and images, Textual also displays any tables and key-value pairs that are present in the file.

Retrieving and using the results

You can download the JSON files from Textual, or retrieve them directly from your cloud storage.

You can also use the Textual Python SDK to retrieve pipelines and pipeline results.

When you use the SDK to retrieve the processed text, you can also specify how to present each type of detected entity. For example, you can redact names and identifiers and synthesize addresses and datetime values. This helps to ensure that your RAG content does not contain sensitive data.

Here are a couple of examples of how to use pipeline output, including how to create RAG chunks and how to add the content to a vector retrieval system.

Recap

The Textual pipeline workflow takes unstructured text and image files and produces Markdown-based content that you can use to populate a vector database for RAG.

The pipeline scans the pipeline files for entities and generates JSON output that contains the generated Markdown and the list of entities.

From Textual, you can view and download the results. You can also use the Textual Python SDK to retrieve pipelines and pipeline results. The SDK includes options to redact or synthesize the detected entities in the returned results.

From here, it’s “choose your own AI adventure”; you decide how best to leverage the data for RAG, and there are many possibilities. Chunk and embed using the strategy that works best for your data and use case, and then use its API to load it into your preferred vector database.

Connect with our team to learn more, or sign up for an account today.

*** This is a Security Bloggers Network syndicated blog from Expert Insights on Synthetic Data from the Tonic.ai Blog authored by Expert Insights on Synthetic Data from the Tonic.ai Blog. Read the original post at: https://tonicfakedata.webflow.io/blog/creating-unstructured-data-pipelines-for-retrieval-augmented-generation


文章来源: https://securityboulevard.com/2026/02/creating-unstructured-data-pipelines-for-retrieval-augmented-generation-2/
如有侵权请联系:admin#unsafe.sh