# python-whaleshop

**Repository Path**: jackfinal/python-whaleshop

## Basic Information

- **Project Name**: python-whaleshop
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-04
- **Last Updated**: 2026-01-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# PDF 转小图并按字母数字命名

此脚本把 PDF 的每页渲染为图片，然后按网格分割为小图片；对每个小图片运行 OCR（pytesseract），并将包含字母+数字的字符串作为文件名。

文件
- `pdf_to_tiles.py`：主脚本
- `requirements.txt`：Python 依赖

先决条件
- 在 Windows 上安装 Tesseract OCR。推荐使用 Chocolatey：

```powershell
choco install -y tesseract
```

或者使用 Winget：

```powershell
winget install --id=UB Mannheim.TesseractOCR -e --source winget
```

- 安装 Python 依赖：

```powershell
python -m pip install -r requirements.txt
```

用法示例

```powershell
# 输出到 tiles_dir，按 512x512 切片
python .\pdf_to_tiles.py .\input.pdf -o .\tiles_dir --tile-width 512 --tile-height 512

# 或者按行列分割（例如 4 行 3 列）
python .\pdf_to_tiles.py .\input.pdf -o .\tiles_dir --rows 4 --cols 3

# 指定 OCR 语言包（例如简体中文）
python .\pdf_to_tiles.py .\input.pdf -o .\tiles_dir --lang chi_sim
```

说明
- 脚本会尝试从每个小图中的 OCR 文本提取包含字母和数字的字符串作为文件名（优先），否则使用 `unknown_p{页}_r{行}_c{列}.png` 命名。
- 若同名文件存在，会自动附加序号避免覆盖。

如需我帮你：
- 把具体的 PDF 文件放到工作区，我可以帮你运行一次并示范输出（需你允许我在环境中运行）。