【Python】unstructured 库：处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），转换为结构化格式_python unstructured

技术文档

unstructured 是一个 Python 开源库，设计用于处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），将其转换为结构化格式，方便下游机器学习（ML）或大语言模型（LLM）任务。它提供模块化的组件（称为“bricks”），支持文档分区、清理和格式化，广泛应用于数据管道、RAG（Retrieval-Augmented Generation）系统和文档分析。

以下是对 unstructured 库的详细介绍，包括其功能、用法和实际应用，结合近期信息（截至 2025）。

1. unstructured 库的作用

非结构化数据处理：将复杂文档（如 PDF、DOCX、HTML）拆分为结构化元素（如标题、段落、列表、表格）。
模块化设计：提供分区（partitioning）、清理（cleaning）和格式化（staging）组件，灵活构建数据处理管道。
多格式支持：支持 25+ 文件类型，包括 TXT、PDF、DOCX、PPTX、HTML、JPG、PNG、EML、CSV、EPUB 等。
AI/LLM 集成：优化数据预处理，生成 LLM 友好的 JSON 格式，适用于 RAG、数据标注和模型训练。
本地和云端支持：提供本地处理和 Serverless API（需 API 密钥），兼顾性能和易用性。
开源与商业产品：核心库开源（Apache 2 许可证），另有付费 API 和平台增强功能。

近期动态：

最新版本（截至 2025-03-19）：0.16.17，提供更高效的预处理性能。🔗
新增 Serverless API，支持更快、更适合生产环境的处理。🔗
与 LangChain 集成，通过 UnstructuredLoader 简化数据加载。🔗

2. 安装与环境要求

Python 版本：支持 Python 3.8+（推荐 3.9+）。
核心依赖：
- beautifulsoup4：HTML 解析。
- lxml：XML 处理。
- nltk：文本处理。
- 可选：tesseract（OCR）、poppler（PDF 处理）、pandoc（EPUB/RTF）。
安装命令：
- 基本安装（不含 PDF/图片处理）：
```
pip install unstructured
```
- 完整安装（含本地推理依赖，如 PDF 和图片处理）：
```
pip install \"unstructured[local-inference]\"
```
- 特定文档类型（如 DOCX）：
```
pip install \"unstructured[docx]\"
```
- 使用 Serverless API：
```
pip install unstructured-client
```
系统依赖（本地处理 PDF/图片）：
- Tesseract：用于 OCR，安装指南：https://tesseract-ocr.github.io/。
- Poppler：PDF 处理，参考 pdf2image 文档：https://pdf2image.readthedocs.io/。
- Pandoc：处理 EPUB、RTF 等，需版本 2.14.2+。
- libmagic：文件类型检测（Linux/Mac 需安装）。
```
# Macbrew install libmagic# Ubuntusudo apt-get install libmagic1
```

验证安装：

import unstructuredprint(unstructured.__version__) # 示例输出: 0.16.17

Docker 支持：

docker pull downloads.unstructured.io/unstructured-io/unstructured:latestdocker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latestdocker exec -it unstructured bash

可自定义 Dockerfile，注释不需要的依赖以加速构建。🔗

3. 核心功能与用法

unstructured 的核心是通过“bricks”处理文档，分为分区、清理和格式化三大类。以下是主要功能和示例。

3.1 分区（Partitioning Bricks）

将文档拆分为结构化元素（如标题、段落、表格），支持自动文件类型检测。

from unstructured.partition.auto import partition# 解析 PDF 文件elements = partition(filename=\"example.pdf\")for element in elements[:5]: print(f\"{element.category}: {element.text}\")

输出示例：

Title: IntroductionNarrativeText: This is the first paragraph...ListItem: - Item 1

说明：

partition 自动检测文件类型，调用特定分区函数（如 partition_pdf、partition_docx）。
支持的文件类型：TXT、PDF、DOCX、PPTX、HTML、JPG、PNG、EML、CSV、EPUB 等。
返回 Element 对象列表，包含类别（Title、NarrativeText、ListItem、Table 等）和元数据。

特定文件类型分区：

from unstructured.partition.pdf import partition_pdf# 高分辨率解析（包含表格）elements = partition_pdf(filename=\"example.pdf\", strategy=\"hi_res\")

说明：

strategy=\"hi_res\" 使用计算机视觉和 OCR 提取表格，适合复杂 PDF。
需安装 tesseract 和 poppler。

3.2 清理（Cleaning Bricks）

移除无关内容，如样板文本、标点或句段碎片。

from unstructured.cleaners.core import clean, remove_punctuationtext = \"Hello, World!!! This is a test...\"cleaned_text = clean(text, lowercase=True) # 转换为小写并清理cleaned_text = remove_punctuation(cleaned_text) # 移除标点print(cleaned_text) # 输出: hello world this is a test

说明：

支持清理操作：小写转换、移除标点、删除样板文本等。
可与分区结果结合，处理提取的 Element 文本。

3.3 格式化（Staging Bricks）

将数据格式化为下游任务的输入，如 JSON 或 LLM 训练数据。

from unstructured.staging.base import convert_to_dict# 转换为 JSONelements = partition(filename=\"example.docx\")json_data = convert_to_dict(elements)print(json_data[:2]) # 输出前两个元素

输出示例：

[ {\"type\": \"Title\", \"text\": \"Introduction\", \"metadata\": {...}}, {\"type\": \"NarrativeText\", \"text\": \"This is the first paragraph...\", \"metadata\": {...}}]

说明：

convert_to_dict 将元素列表转为 JSON，适合 LLM 或数据分析。
支持其他格式化函数，如 stage_for_transformers（Hugging Face 集成）。

3.4 LangChain 集成

与 LangChain 配合，通过 UnstructuredLoader 加载文档。

from langchain_unstructured import UnstructuredLoader# 本地加载loader = UnstructuredLoader(file_path=\"example.pdf\")docs = loader.load()print(docs[0].page_content[:100]) # 输出提取的文本# 使用 Serverless APIloader = UnstructuredLoader( file_path=\"example.pdf\", api_key=\"your_api_key\", strategy=\"hi_res\")docs = loader.load()

说明：

本地加载需安装 unstructured 和 langchain_unstructured 。🔗
Serverless API 需安装unstructured-client 和 langchain_unstructured ，以及 API 密钥（可从 https://unstructured.io/ 获取）。🔗
支持多种文件类型加载器，如 UnstructuredCSVLoader、UnstructuredHTMLLoader 等。🔗

3.5 使用 Serverless API

通过 API 进行高效处理，减少本地依赖。

from unstructured_client import UnstructuredClientclient = UnstructuredClient(api_key_auth=\"your_api_key\")with open(\"example.pdf\", \"rb\") as f: response = client.general.partition(file=f, strategy=\"hi_res\")print(response.elements[:2])

说明：

需安装 unstructured-client。
Serverless API 提供更高性能，适合生产环境。

3.6 Docker 部署

在 Docker 容器中运行 unstructured。

# 在容器内运行 Python 脚本from unstructured.partition.auto import partitionelements = partition(filename=\"/data/example.pdf\")print([str(el) for el in elements[:5]])

说明：

使用官方 Docker 镜像，简化环境配置。
支持多平台（x86_64 和 Apple Silicon）。

4. 性能与特点

高效性：模块化设计，灵活组合处理步骤。
多格式支持：覆盖常见文档和图片格式，减少格式转换需求。
易用性：partition 函数一键解析，降低学习曲线。
社区支持：活跃的 GitHub 仓库（6K+ 星，截至 2024）。
局限性：
- DOCX 解析可能误识列表项为标题或段落。
- 大型文档缺乏父子关系标注，影响 LLM 上下文理解。
- 本地处理需安装较多依赖（如 tesseract、poppler）。
隐私与分析：
- 包含轻量级分析“ping”，可通过设置环境变量 DO_NOT_TRACK=true 禁用。
- 另可设置 SCARF_NO_ANALYTICS=true 禁用 Scarf 统计。

与替代方案对比：

python-docx/pdfplumber：专注于单一格式，功能有限。
LangChain 内置加载器：依赖 unstructured，但封装更简单。
extractous：号称比 unstructured 快 25 倍，支持类似格式，但生态较新。

5. 实际应用场景

RAG 系统：将文档转为 JSON，供 LLM 检索和生成。
数据预处理：为 ML 模型准备训练数据（如文本分类、NER）。
文档分析：提取合同、报告中的关键信息。
内容提取：从网页、邮件或 PDF 中提取结构化内容。
个人 AI 助手：本地处理笔记和文档，保护隐私。

示例（RAG 管道）：

from unstructured.partition.auto import partitionfrom unstructured.staging.base import convert_to_dictfrom langchain_unstructured import UnstructuredLoaderimport json# 解析文档elements = partition(filename=\"report.pdf\", strategy=\"hi_res\")json_data = convert_to_dict(elements)# 保存为 JSONwith open(\"output.json\", \"w\") as f: json.dump(json_data, f, indent=2)# 加载到 LangChainloader = UnstructuredLoader(file_path=\"report.pdf\")docs = loader.load()# 假设使用 LLM 进行问答from langchain.llms import OpenAIllm = OpenAI(api_key=\"your_openai_key\")response = llm(f\"Summarize: {docs[0].page_content[:500]}\")print(response)

说明：

解析 PDF，提取结构化元素并保存为 JSON。
使用 LangChain 加载数据，结合 LLM 进行总结。

6. 部署与扩展

本地运行：
- 安装依赖后直接运行 Python 脚本。
- 使用 pyenv 管理虚拟环境，推荐 Python 3.8.15。
Docker 部署：
- 拉取官方镜像或构建自定义镜像，适合生产环境。
Serverless API：
- 注册 API 密钥，调用 unstructured-client 处理文件。
- 适合快速原型或大规模处理。
贡献与调试：
- 提交 bug 使用 scripts/collect_env.py 收集环境信息。
- GitHub 仓库：https://github.com/Unstructured-IO/unstructured。

7. 注意事项

依赖管理：
- 本地处理 PDF/图片需安装 tesseract 和 poppler，否则报错。
- 使用 pip install \"unstructured[all-docs]\" 安装所有文档类型依赖。
性能优化：
- 对于单一文件类型，安装特定依赖（如 unstructured[docx]）减少开销。
- 使用 strategy=\"hi_res\" 提取表格，但计算成本较高。
局限性：
- DOCX 列表项可能被误识别为标题，需后处理验证。
- 缺少父子关系标注，需手动解析上下文。
隐私：
- 禁用分析 ping：export DO_NOT_TRACK=true 或 SCARF_NO_ANALYTICS=true。
替代工具：
- extractous：更快的替代，支持类似格式，但社区较小。
- LayoutParser：专注于文档图像分析，适合复杂布局。

8. 综合示例

以下是一个综合示例，展示分区、清理、格式化和 LangChain 集成：

from unstructured.partition.auto import partitionfrom unstructured.cleaners.core import clean, remove_punctuationfrom unstructured.staging.base import convert_to_dictfrom langchain_unstructured import UnstructuredLoaderimport json# 配置日志（使用 loguru）from loguru import loggerlogger.add(\"app.log\", rotation=\"1 MB\", level=\"INFO\")# 解析 PDFlogger.info(\"Starting PDF processing\")try: elements = partition(filename=\"sample.pdf\", strategy=\"hi_res\")except Exception as e: logger.exception(\"Failed to process PDF\") raise# 清理文本cleaned_elements = []for element in elements: text = clean(element.text, lowercase=True) text = remove_punctuation(text) cleaned_elements.append({\"type\": element.category, \"text\": text})logger.info(\"Text cleaning completed\")# 转换为 JSONjson_data = convert_to_dict(cleaned_elements)with open(\"output.json\", \"w\") as f: json.dump(json_data, f, indent=2)logger.info(\"JSON output saved\")# LangChain 集成loader = UnstructuredLoader(file_path=\"sample.pdf\", strategy=\"hi_res\")docs = loader.load()logger.info(f\"Loaded {len(docs)} documents\")# 打印前 100 个字符print(docs[0].page_content[:100])

输出示例（app.log）：

2025-05-09T01:33:56.123 | INFO | Starting PDF processing2025-05-09T01:33:57.124 | INFO | Text cleaning completed2025-05-09T01:33:57.125 | INFO | JSON output saved2025-05-09T01:33:57.126 | INFO | Loaded 1 documents

说明：

使用 partition 解析 PDF，hi_res 策略提取表格。
清理文本，移除标点并转换为小写。
保存为 JSON，供下游任务使用。
通过 LangChain 加载文档，记录日志。

9. 资源与文档

官方文档：https://docs.unstructured.io/
GitHub 仓库：https://github.com/Unstructured-IO/unstructured
PyPI 页面：https://pypi.org/project/unstructured/
LangChain 集成：https://python.langchain.com/docs/integrations/document_loaders/unstructured/
社区支持：Unstructured Slack 社区（https://unstructured.io/community）
官方快速入门：https://docs.unstructured.io/open-source/introduction/quick-start

【Python】unstructured 库：处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），转换为结构化格式_python unstructured

1. unstructured 库的作用

2. 安装与环境要求

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

【Python】unstructured 库：处理和预处理非结构化数据（如 PDF、Word 文档、HTML、图片等），转换为结构化格式_python unstructured

1. unstructured 库的作用

2. 安装与环境要求

3. 核心功能与用法

3.1 分区（Partitioning Bricks）

3.2 清理（Cleaning Bricks）

3.3 格式化（Staging Bricks）

3.4 LangChain 集成

3.5 使用 Serverless API

3.6 Docker 部署

4. 性能与特点

5. 实际应用场景

6. 部署与扩展

7. 注意事项

8. 综合示例

9. 资源与文档

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签