# GLM-OCR
**Repository Path**: zhch158_admin/GLM-OCR
## Basic Information
- **Project Name**: GLM-OCR
- **Description**: No description available
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-03
- **Last Updated**: 2026-03-03
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## GLM-OCR
đź‘‹ Join our WeChat and Discord community
📍 Use GLM-OCR's API
### Model Introduction
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
**Key Features**
- **State-of-the-Art Performance**: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.
- **Optimized for Real-World Scenarios**: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.
- **Efficient Inference**: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.
- **Easy to Use**: Fully open-sourced and equipped with a comprehensive [SDK](https://github.com/zai-org/GLM-OCR) and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.
### News & Updates
- **[Coming Soon]** GLM-OCR Technical Report
- **[2026.2.12]** Fine-tuning tutorial based on LLaMA-Factory is now available. See: [GLM-OCR Fine-tuning Guide](examples/finetune/README.md)
### Download Model
| Model | Download Links | Precision |
| ------- | --------------------------------------------------------------------------------------------------------------------------- | --------- |
| GLM-OCR | [🤗 Hugging Face](https://huggingface.co/zai-org/GLM-OCR)
[🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-OCR) | BF16 |
## GLM-OCR SDK
We provide an SDK for using GLM-OCR more efficiently and conveniently.
### Install SDK
> [UV Installation](https://docs.astral.sh/uv/getting-started/installation/)
```bash
# Install from source
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .
```
### Model Deployment
Two ways to use GLM-OCR:
#### Option 1: Zhipu MaaS API (Recommended for Quick Start)
Use the hosted cloud API – no GPU needed. The cloud service runs the complete GLM-OCR pipeline internally, so the SDK simply forwards your request and returns the result.
1. Get an API key from https://open.bigmodel.cn
2. Configure `config.yaml`:
```yaml
pipeline:
maas:
enabled: true # Enable MaaS mode
api_key: your-api-key # Required
```
That's it! When `maas.enabled=true`, the SDK acts as a thin wrapper that:
- Forwards your documents to the Zhipu cloud API
- Returns the results directly (Markdown + JSON layout details)
- No local processing, no GPU required
Input note (MaaS): the upstream API accepts `file` as a URL or a `data:;base64,...` data URI.
If you have raw base64 without the `data:` prefix, wrap it as a data URI (recommended). The SDK will
auto-wrap local file paths / bytes / raw base64 into a data URI when calling MaaS.
API documentation: https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr
#### Option 2: Self-host with vLLM / SGLang
Deploy the GLM-OCR model locally for full control. The SDK provides the complete pipeline: layout detection, parallel region OCR, and result formatting.
##### Using vLLM
Install vLLM:
```bash
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Or use Docker
docker pull vllm/vllm-openai:nightly
```
Launch the service:
```bash
# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git
# Run with MTP for better performance
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --served-model-name glm-ocr
```
##### Using SGLang
Install SGLang:
```bash
docker pull lmsysorg/sglang:dev
# Or build from source
uv pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python
```
Launch the service:
```bash
# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git
# Run with MTP for better performance
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --served-model-name glm-ocr
# Modify the speculative config base on your device
```
##### Update Configuration
After launching the service, configure `config.yaml`:
```yaml
pipeline:
maas:
enabled: false # Disable MaaS mode (default)
ocr_api:
api_host: localhost # or your vLLM/SGLang server address
api_port: 8080
```
#### Option 3: Ollama/MLX
For specialized deployment scenarios, see the detailed guides:
- **[Apple Silicon with mlx-vlm](examples/mlx-deploy/README.md)** - Optimized for Apple Silicon Macs
- **[Ollama Deployment](examples/ollama-deploy/README.md)** - Simple local deployment with Ollama
### SDK Usage Guide
#### CLI
```bash
# Parse a single image
glmocr parse examples/source/code.png
# Parse a directory
glmocr parse examples/source/
# Set output directory
glmocr parse examples/source/code.png --output ./results/
# Use a custom config
glmocr parse examples/source/code.png --config my_config.yaml
# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG
```
#### Python API
```python
from glmocr import GlmOcr, parse
# Simple function
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"])
result = parse("https://example.com/image.png")
result.save(output_dir="./results")
# Note: a list is treated as pages of a single document.
# Class-based API
with GlmOcr() as parser:
result = parser.parse("image.png")
print(result.json_result)
result.save()
```
#### Flask Service
```bash
# Start service
python -m glmocr.server
# With debug logging
python -m glmocr.server --log-level DEBUG
# Call API
curl -X POST http://localhost:5002/glmocr/parse \
-H "Content-Type: application/json" \
-d '{"images": ["./example/source/code.png"]}'
```
Semantics:
- `images` can be a string or a list.
- A list is treated as pages of a single document.
- For multiple independent documents, call the endpoint multiple times (one document per request).
### Configuration
Full configuration in `glmocr/config.yaml`:
```yaml
# Server (for glmocr.server)
server:
host: "0.0.0.0"
port: 5002
debug: false
# Logging
logging:
level: INFO # DEBUG enables profiling
# Pipeline
pipeline:
# OCR API connection
ocr_api:
api_host: localhost
api_port: 8080
api_key: null # or set API_KEY env var
connect_timeout: 300
request_timeout: 300
# Page loader settings
page_loader:
max_tokens: 16384
temperature: 0.01
image_format: JPEG
min_pixels: 12544
max_pixels: 71372800
# Result formatting
result_formatter:
output_format: both # json, markdown, or both
# Layout detection (optional)
enable_layout: false
```
See [config.yaml](glmocr/config.yaml) for all options.
### Output Formats
Here are two examples of output formats:
- JSON
```json
[[{ "index": 0, "label": "text", "content": "...", "bbox_2d": null }]]
```
- Markdown
```markdown
# Document Title
Body...
| Table | Content |
| ----- | ------- |
| ... | ... |
```
### Example of full pipeline
you can run example code like:
```bash
python examples/example.py
```
Output structure (one folder per input):
- `result.json` – structured OCR result
- `result.md` – Markdown result
- `imgs/` – cropped image regions (when layout mode is enabled)
### Modular Architecture
GLM-OCR uses composable modules for easy customization:
| Component | Description |
| --------------------- | -------------------------------------- |
| `PageLoader` | Preprocessing and image encoding |
| `OCRClient` | Calls the GLM-OCR model service |
| `PPDocLayoutDetector` | PP-DocLayout layout detection |
| `ResultFormatter` | Post-processing, outputs JSON/Markdown |
You can extend the behavior by creating custom pipelines:
```python
from glmocr.dataloader import PageLoader
from glmocr.ocr_client import OCRClient
from glmocr.postprocess import ResultFormatter
class MyPipeline:
def __init__(self, config):
self.page_loader = PageLoader(config)
self.ocr_client = OCRClient(config)
self.formatter = ResultFormatter(config)
def process(self, request_data):
# Implement your own processing logic
pass
```
## Acknowledgement
This project is inspired by the excellent work of the following projects and communities:
- [PP-DocLayout-V3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [MinerU](https://github.com/opendatalab/MinerU)
## License
The Code of this repo is under Apache License 2.0.
The GLM-OCR model is released under the MIT License.
The complete OCR pipeline integrates [PP-DocLayoutV3](https://huggingface.co/PaddlePaddle/PP-DocLayoutV3) for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.
## Citation
GLM-OCR technical report is coming soon.