# RapidOCRPDF **Repository Path**: chuchumaolu555/RapidOCRPDF ## Basic Information - **Project Name**: RapidOCRPDF - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 2 - **Created**: 2023-10-07 - **Last Updated**: 2023-10-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## RapidOCRPDF

- 依托于[RapidOCR](https://github.com/RapidAI/RapidOCR)仓库，快速提取PDF中文字，包括扫描版PDF、加密版PDF。 - 如果是可以直接复制的PDF，可以直接使用[pdf2docx](https://github.com/dothinking/pdf2docx)，不再重复造轮子 - 如果是扫描版PDF，暂时不支持版式还原，后续有空会考虑加上，日期不定。 ### 使用 1. 安装`rapidocr_pdf`库 ```bash # 基于rapidocr_onnxruntime pip install rapidocr_pdf[onnxruntime] # 基于rapidocr_openvino pip install rapidocr_pdf[openvino] ``` 2. 使用 - 脚本使用： ```python from rapidocr_pdf import PDFExtracter pdf_extracter = PDFExtracter() pdf_path = 'tests/test_files/direct_and_image.pdf' texts = pdf_extracter(pdf_path) print(texts) ``` - 命令行使用 ```bash $ rapidocr_pdf -h usage: rapidocr_pdf [-h] [-path FILE_PATH] options: -h, --help show this help message and exit -path FILE_PATH, --file_path FILE_PATH File path, PDF or images $ rapidocr_pdf -path tests/test_files/direct_and_image.pdf ``` 3. 输入输出说明 - **输入**：`Union[str, Path, bytes]` - **输出**：`List` \[**页码**, **文本内容**, **置信度**\]，具体参见下例： ```python [ ['0', '人之初，性本善。性相近，习相远。', '0.8969868'], ['1', 'Men at their birth, are naturally good.', '0.8969868'], ] ``` ### 更新日志 - 2023-04-17 v0.0.2 update: - 完善使用文档