# split-documents **Repository Path**: zhch158_admin/split-documents ## Basic Information - **Project Name**: split-documents - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-03-15 - **Last Updated**: 2024-08-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### 1. 安装python3.10虚拟环境 python3.12无法安装unstructured->onnx ``` /d/python/python310/python -m venv .venv . .venv/Scripts/activate ``` 为了使用jupyter notebook,需要安装ipykernel ### 2. 设置环境变量 ``` PYTHONPATH="${workspaceFolder};${env:PYTHONPATH}" OPENAI_API_BASE="https://api.openai.com/v1" OPENAI_API_KEY="sk-......" GEMINI_API_KEY="AI......" ZHIPU_API_KEY="b48......" DASHSCOPE_API_KEY="sk-......" NLTK_DATA="d:/models/nltk_data" HF_HOME="d:/models/hf_home" HF_ENDPOINT=https://hf-mirror.com HF_HUB_OFFLINE=0 TORCH_HOME="d:/models/torch/" ``` ### 2. 下载 NLTK Data Create a folder nltk_data, e.g. C:\nltk_data, or /usr/local/share/nltk_data, and subfolders chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers. Download individual packages from https://www.nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown. Set your NLTK_DATA environment variable to point to your top level nltk_data folder. ``` NLTK_DATA="d:/models/nltk_data" cd d:/models/nltk_data mkdir chunkers grammars misc sentiment taggers corpora help models stemmers tokenizers ``` [安装nltk库及nltk_data数据包](https://blog.csdn.net/Baoweijie12/article/details/121480639) 下载nltk数据包:https://gitee.com/qwererer2/nltk_data.git, gh-pages分支, 以码云为例,只需要把项目的整个packages文件夹下载下来 ### 3. 设置Huggingface模型下载路径 ``` export TRANSFORMERS_OFFLINE=1 export HF_HOME="d:/models/hf_home" export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download InstantX/InstantID --local-dir checkpoints ``` ### 4. 下载poppler Windows安装配置poppler(这里只介绍Windows,Mac和Linux去上面Github地址里面参考官网) Windows用户必须为Windows安装poppler,然后将bin/文件夹添加到PATH(开始>输入env>编辑系统环境变量>环境变量...>系统变量>Path) ``` https://github.com/oschwartz10612/poppler-windows ``` ### 5. 安装tesseract 将Tesseract-OCR路径添加到环境变量path D:\models\Tesseract-OCR 下载tesseract简体中文模型, https://tesseract-ocr.github.io/tessdoc/Data-Files 设置环境变量TESSDATA_PREFIX=d:/models/Tesseract-OCR/tessdata ``` https://github.com/UB-Mannheim/tesseract/wiki ``` ### 6. 安装模型 程序会自动下载模型到 ``` "D:\models\hf_home\hub\models--timm--resnet18.a1_in1k" "D:\models\hf_home\hub\models--unstructuredio--yolo_x_layout" "D:\models\hf_home\hub\models--microsoft--table-transformer-structure-recognition" "D:\models\hf_home\hub\models--sentence-transformers--all-MiniLM-L6-v2" ``` # 参考文章 1. [LangChain - RAG: 拿什么「降伏」PDF 中的 Table 类型数据](https://zhuanlan.zhihu.com/p/662180611)