# paddleocr-api-python **Repository Path**: Jerry-Wu-Gitee/paddleocr-api-python ## Basic Information - **Project Name**: paddleocr-api-python - **Description**: Paddle OCR API 的 Python SDK - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-29 - **Last Updated**: 2026-05-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # paddleocr-api-python [English](#english) | [中文](#中文) --- ## English A Python async SDK that wraps the [PaddleOCR AI Studio API](https://aistudio.baidu.com/) into a clean, type-safe interface. Upload a document, await the result, and get Markdown back — without touching raw HTTP. ### Features - **Async-first** — built on `httpx.AsyncClient` and `asyncio`, with native context manager support. - **Full model coverage** — `PaddleOCR-VL-1.6` (default), `PaddleOCR-VL-1.5`, `PaddleOCR-VL`, `PP-OCRv5`, and `PaddleOCR`. - **Flexible input** — submit by local file path, raw bytes, or remote URL. - **Rich job control** — poll real-time state, extracted page count, start/end times, and error messages. - **Markdown export** — get a clean Markdown document plus the URLs of all embedded images. - **Fine-grained options** — toggle layout detection, chart/seal/table recognition, cross-page table merging, title leveling, NMS, image orientation correction, and more. ### Installation ```bash pip install paddleocr-api-python ``` Dependencies: `aiofiles`, `httpx`, `typing-extensions`, `python-dotenv`. ### Authentication Get an access token from [https://aistudio.baidu.com/account/accessToken](https://aistudio.baidu.com/account/accessToken). Either pass it explicitly: ```python client = AistudioClient(api_key="your_token_here") ``` Or set it via environment variable (a `.env` file is loaded automatically): ``` AISTUDIO_ACCESS_TOKEN=your_token_here ``` ### Quick Start ```python import asyncio from paddleocr_api import AistudioClient, State async def main(): async with AistudioClient() as client: job = await client.create_job(file_path="paper.pdf") async with job: while True: state = await job.state if state == State.DONE: break if state == State.FAILED: raise RuntimeError(await job.error_message) await asyncio.sleep(5) markdown = await job.markdown with open("output.md", "w", encoding="utf-8") as f: f.write(markdown.text) asyncio.run(main()) ``` ### Submitting Jobs `create_job` accepts three mutually compatible input modes: ```python # From a local path await client.create_job(file_path="doc.pdf") # From bytes already in memory await client.create_job(file_bytes=pdf_bytes) # From a public URL await client.create_job(file_url="https://example.com/doc.pdf") ``` ### Selecting a Model ```python from paddleocr_api import Model await client.create_job( file_path="doc.pdf", model=Model.PADDLE_OCR_VL_1_6, # default ) ``` | Model | Notes | |---|---| | `PaddleOCR-VL-1.6` | Default. Latest vision-language model. | | `PaddleOCR-VL-1.5` | Scheduled for retirement on 2026-06-17. | | `PaddleOCR-VL` | Base VL model. | | `PP-OCRv5` | Classic OCR pipeline. | | `PaddleOCR` | Base OCR. | ### Optional Payload Pass an `OptionalPayload` dict to fine-tune recognition behavior: ```python from paddleocr_api import LayoutShapeMode, PromptLabel await client.create_job( file_path="doc.pdf", optional_payload={ "useLayoutDetection": True, "useChartRecognition": True, "useSealRecognition": True, "mergeTables": True, "relevelTitles": True, "layoutShapeMode": LayoutShapeMode.AUTO, "repetitionPenalty": 1.0, "temperature": 0.0, "topP": 1.0, }, ) ``` Key options: | Field | Default | Purpose | |---|---|---| | `useDocOrientationClassify` | `False` | Auto-correct 0/90/180/270° rotation. | | `useDocUnwarping` | `False` | Flatten warped or wrinkled pages. | | `useLayoutDetection` | `True` | Region-aware parsing. Disable for single-region docs. | | `useChartRecognition` | `False` | Convert charts to tables. | | `useSealRecognition` | `True` | Extract seal text. | | `useOcrForImageBlock` | `False` | OCR inside image regions. | | `mergeTables` | `True` | Merge tables that span pages. | | `relevelTitles` | `True` | Infer heading hierarchy. | | `repetitionPenalty` | `1.0` | Raise to suppress repeated output. | | `temperature` | `0.0` | Lower for stability, higher to reduce omissions. | | `topP` | `1.0` | Lower for more conservative output. | | `layoutNms` | `True` | Drop overlapping detection boxes. | | `markdownIgnoreLabels` | all | Filter headers, footers, page numbers, footnotes, etc. | ### Tracking a Job ```python async with job: print(await job.state) # State.PENDING / RUNNING / DONE / FAILED print(await job.total_pages) # e.g. 8 print(await job.extracted_pages) # e.g. 3 print(await job.start_time) # datetime print(await job.end_time) # datetime print(await job.error_message) # str or None ``` Status queries are cached for `status_update_interval` seconds (default `2`) to avoid hammering the API. ### Working with Results ```python result = await job.result # full Result object markdown = await job.markdown # Markdown(text=..., images=...) # Save Markdown with open("doc.md", "w", encoding="utf-8") as f: f.write(markdown.text) # Download embedded images import httpx async with httpx.AsyncClient() as http: for rel_path, url in markdown.images.items(): data = (await http.get(url)).content # write `data` to `rel_path` ``` The `Result` object also exposes per-page layout details via `layout_parsing_results`, raw page sizes via `data_info`, and preprocessed image URLs via `preprocessed_images`. ### Error Handling All exceptions inherit from `PaddleOCRError`: - `AistudioClientError` — client configuration issues (e.g. missing token). - `JobCreationError` — failure when submitting a job. - `JobStatusQueryError` — failure when polling status. Use `job.query_status_safe()` instead of `query_status()` to get the cached state on failure rather than raising. ### License [Apache-2.0](LICENSE) --- ## 中文 将 [PaddleOCR AI Studio API](https://aistudio.baidu.com/) 封装为简洁、类型安全的 Python 异步 SDK。上传文档、等待结果、拿到 Markdown —— 无需手写任何 HTTP 请求。 ### 特性 - **异步优先** —— 基于 `httpx.AsyncClient` 与 `asyncio` 构建,原生支持上下文管理器。 - **全模型支持** —— `PaddleOCR-VL-1.6`(默认)、`PaddleOCR-VL-1.5`、`PaddleOCR-VL`、`PP-OCRv5`、`PaddleOCR`。 - **灵活输入** —— 支持本地路径、字节流、远程 URL 三种提交方式。 - **完善的任务控制** —— 实时查询状态、已抽取页数、起止时间、错误信息。 - **Markdown 导出** —— 直接获取整洁的 Markdown 文本及所有内嵌图片 URL。 - **细粒度参数** —— 可控制版面分析、图表/印章/表格识别、跨页表格合并、标题分级、NMS、图像方向矫正等。 ### 安装 ```bash pip install paddleocr-api-python ``` 依赖:`aiofiles`、`httpx`、`typing-extensions`、`python-dotenv`。 ### 身份验证 在 [https://aistudio.baidu.com/account/accessToken](https://aistudio.baidu.com/account/accessToken) 获取访问令牌。 可以显式传入: ```python client = AistudioClient(api_key="your_token_here") ``` 也可以通过环境变量传入(自动加载 `.env` 文件): ``` AISTUDIO_ACCESS_TOKEN=your_token_here ``` ### 快速上手 ```python import asyncio from paddleocr_api import AistudioClient, State async def main(): async with AistudioClient() as client: job = await client.create_job(file_path="paper.pdf") async with job: while True: state = await job.state if state == State.DONE: break if state == State.FAILED: raise RuntimeError(await job.error_message) await asyncio.sleep(5) markdown = await job.markdown with open("output.md", "w", encoding="utf-8") as f: f.write(markdown.text) asyncio.run(main()) ``` ### 提交任务 `create_job` 支持三种输入方式: ```python # 本地路径 await client.create_job(file_path="doc.pdf") # 内存字节流 await client.create_job(file_bytes=pdf_bytes) # 公网 URL await client.create_job(file_url="https://example.com/doc.pdf") ``` ### 选择模型 ```python from paddleocr_api import Model await client.create_job( file_path="doc.pdf", model=Model.PADDLE_OCR_VL_1_6, # 默认 ) ``` | 模型 | 备注 | |---|---| | `PaddleOCR-VL-1.6` | 默认,最新视觉语言模型。 | | `PaddleOCR-VL-1.5` | 计划于 2026-06-17 下线。 | | `PaddleOCR-VL` | 基础 VL 模型。 | | `PP-OCRv5` | 经典 OCR 流水线。 | | `PaddleOCR` | 基础 OCR。 | ### 可选参数 通过 `OptionalPayload` 字典精调识别行为: ```python from paddleocr_api import LayoutShapeMode, PromptLabel await client.create_job( file_path="doc.pdf", optional_payload={ "useLayoutDetection": True, "useChartRecognition": True, "useSealRecognition": True, "mergeTables": True, "relevelTitles": True, "layoutShapeMode": LayoutShapeMode.AUTO, "repetitionPenalty": 1.0, "temperature": 0.0, "topP": 1.0, }, ) ``` 常用参数: | 字段 | 默认值 | 作用 | |---|---|---| | `useDocOrientationClassify` | `False` | 自动矫正 0/90/180/270° 旋转。 | | `useDocUnwarping` | `False` | 矫正褶皱、倾斜等扭曲图像。 | | `useLayoutDetection` | `True` | 版面分区与排序。文档仅含单一区域时可关闭。 | | `useChartRecognition` | `False` | 将图表解析为表格。 | | `useSealRecognition` | `True` | 识别印章文字。 | | `useOcrForImageBlock` | `False` | 对图片区域中的文字进行 OCR。 | | `mergeTables` | `True` | 合并跨页表格。 | | `relevelTitles` | `True` | 识别段落标题级别。 | | `repetitionPenalty` | `1.0` | 出现重复内容时可调高。 | | `temperature` | `0.0` | 调低更稳定,调高减少漏识别。 | | `topP` | `1.0` | 调低让模型更保守。 | | `layoutNms` | `True` | 移除重叠的检测框。 | | `markdownIgnoreLabels` | 全部 | 过滤页眉、页脚、页码、脚注等辅助元素。 | ### 追踪任务 ```python async with job: print(await job.state) # State.PENDING / RUNNING / DONE / FAILED print(await job.total_pages) # 如 8 print(await job.extracted_pages) # 如 3 print(await job.start_time) # datetime print(await job.end_time) # datetime print(await job.error_message) # str 或 None ``` 状态查询带有 `status_update_interval` 秒的缓存(默认 `2` 秒),避免频繁请求。 ### 处理结果 ```python result = await job.result # 完整的 Result 对象 markdown = await job.markdown # Markdown(text=..., images=...) # 保存 Markdown with open("doc.md", "w", encoding="utf-8") as f: f.write(markdown.text) # 下载内嵌图片 import httpx async with httpx.AsyncClient() as http: for rel_path, url in markdown.images.items(): data = (await http.get(url)).content # 将 data 写入 rel_path ``` `Result` 对象还通过 `layout_parsing_results` 暴露每页的版面细节,通过 `data_info` 提供原始页面尺寸,通过 `preprocessed_images` 提供预处理图像 URL。 ### 异常处理 所有异常都继承自 `PaddleOCRError`: - `AistudioClientError` —— 客户端配置错误(如缺少令牌)。 - `JobCreationError` —— 任务提交失败。 - `JobStatusQueryError` —— 状态查询失败。 如果希望查询失败时返回缓存而非抛出异常,使用 `job.query_status_safe()` 代替 `query_status()`。 ### 许可证 [Apache-2.0](LICENSE)