# mineru-vl-utils

**Repository Path**: zhch158_admin/mineru-vl-utils

## Basic Information

- **Project Name**: mineru-vl-utils
- **Description**: No description available
- **Primary Language**: Python
- **License**: AGPL-3.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-26
- **Last Updated**: 2025-11-26

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# mineru-vl-utils

A Python package for interacting with the MinerU Vision-Language Model.

It's a lightweight wrapper that simplifies the process of sending requests
and handling responses from the MinerU Vision-Language Model.

## About Backends

We provides 6 different backends(deployment modes):

1. **http-client**: A HTTP client for interacting with the OpenAI-compatible model server.
2. **transformers**: A backend for using HuggingFace Transformers models. (slow but simple to install)
3. **mlx-engine**: A backend for using Apple Silicon devices with macOS.
4. **lmdeploy-engine**: A backend for using the LmDeploy engine.
5. **vllm-engine**: A backend for using the VLLM synchronous batching engine.
6. **vllm-async-engine**: A backend for using the VLLM asynchronous engine. (requires async programming)

## About Output Format

MinerU Vision-Language Model can handle document layout detection and
text/table/equation recognition tasks in a same model.

The output of the model is a list of `ContentBlock` objects, each representing
a detected block in the document with its content recognition results.

Each `ContentBlock` contains the following attributes:

- `type` (str): The type of the block, e.g., 'text', 'image', 'table', 'equation'.
  - For a complete list of supported block types, please refer to [structs.py](mineru_vl_utils/structs.py).
- `bbox` (list of floats): The bounding box of the block in the format [xmin, ymin, xmax, ymax],
  with coordinates normalized to the range [0, 1].
- `angle` (int or None): The rotation angle of the block, can be one of [0, 90, 180, 270].
  - `0` means upward.
  - `90` means rightward.
  - `180` means upside down.
  - `270` means leftward.
  - `None` means the angle is not specified.
- `content` (str or None): The recognized content of the block, if applicable.
  - For 'text' blocks, this is the recognized text.
  - For 'table' blocks, this is the recognized table in HTML format.
  - For 'equation' blocks, this is the recognized LaTeX code.
  - For 'image' blocks, this is `None`.

## Installation

For `http-client` backend, just install the package via pip:

```bash
pip install -U mineru-vl-utils
```

For `transformers` backend, install the package with the `transformers` extra:

```bash
pip install -U "mineru-vl-utils[transformers]"
```

For `vllm-engine` and `vllm-async-engine` backend, install the package with the `vllm` extra:

```bash
pip install -U "mineru-vl-utils[vllm]"
```

For `mlx-engine` backend, install the package with the `mlx` extra:

```bash
pip install -U "mineru-vl-utils[mlx]"
```

For `lmdeploy-engine` backend, install the package with the `lmdeploy` extra:

```bash
pip install -U "mineru-vl-utils[lmdeploy]"
```

> [!NOTE]
> For using the `http-client` backend, you still need to have another 
> `vllm`(or other LLM deployment tool) environment to serve the model as a http server.

## Serving the Model (Optional)

> This is only needed if you want to use the `http-client` backend.

You can use `vllm` or another LLM deployment tool to serve the model.
Here we only demonstrate how to use `vllm` to serve the model.

With vllm>=0.10.1, you can use following command to serve the model.
The logits processor is used to support `no_repeat_ngram_size` sampling param,
which can help the model to avoid generating repeated content.

```bash
vllm serve opendatalab/MinerU2.5-2509-1.2B --host 127.0.0.1 --port 8000 \
  --logits-processors mineru_vl_utils:MinerULogitsProcessor
```

If you are using vllm<0.10.1, `no_repeat_ngram_size` sampling param is not supported.
You still can serve the model without logits processor:

```bash
vllm serve opendatalab/MinerU2.5-2509-1.2B --host 127.0.0.1 --port 8000
```

## Using `MinerUClient` by Code

Now you can use the `MinerUClient` class to interact with the model.
Following are examples of using different backends.

### `http-client` Example

```python
from PIL import Image
from mineru_vl_utils import MinerUClient

client = MinerUClient(
    backend="http-client",
    server_url="http://127.0.0.1:8000"
)

image = Image.open("/path/to/the/test/image.png")
extracted_blocks = client.two_step_extract(image)
print(extracted_blocks)
```

### `transformers` Example

```python
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
from mineru_vl_utils import MinerUClient

# for transformers>=4.56.0
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "opendatalab/MinerU2.5-2509-1.2B",
    dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "opendatalab/MinerU2.5-2509-1.2B",
    use_fast=True
)

client = MinerUClient(
    backend="transformers",
    model=model,
    processor=processor
)

image = Image.open("/path/to/the/test/image.png")
extracted_blocks = client.two_step_extract(image)
print(extracted_blocks)
```

If you used an old version of `transformers`(`transformers<4.56.0`),
you need to use `torch_dtype` instead of `dtype`.

```python
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "opendatalab/MinerU2.5-2509-1.2B",
    torch_dtype="auto",
    device_map="auto"
)
```

### `mlx-engine` Example

```python
from mlx_vlm import load as mlx_load
from PIL import Image
from mineru_vl_utils import MinerUClient

model, processor = mlx_load("opendatalab/MinerU2.5-2509-1.2B")

client = MinerUClient(
    backend="mlx-engine",
    model=model,
    processor=processor
)

image = Image.open("/path/to/the/test/image.png")
extracted_blocks = client.two_step_extract(image)
print(extracted_blocks)
```

### `lmdeploy-engine` Example

For default inference engine(`turbomind` by now).

```python
from lmdeploy.serve.vl_async_engine import VLAsyncEngine
from mineru_vl_utils import MinerUClient
from PIL import Image

if __name__ == "__main__":
    lmdeploy_engine = VLAsyncEngine("opendatalab/MinerU2.5-2509-1.2B")

    client = MinerUClient(
        backend="lmdeploy-engine",
        lmdeploy_engine=lmdeploy_engine,
    )

    image = Image.open("/path/to/the/test/image.png")
    extracted_blocks = client.two_step_extract(image)
    print(extracted_blocks)
```

For pytorch inference engine and `ascend` accelerator.

```python
from lmdeploy import PytorchEngineConfig
from lmdeploy.serve.vl_async_engine import VLAsyncEngine
from mineru_vl_utils import MinerUClient
from PIL import Image

if __name__ == "__main__":
    lmdeploy_engine = VLAsyncEngine(
        "opendatalab/MinerU2.5-2509-1.2B",
        backend="pytorch",
        backend_config=PytorchEngineConfig(
            device_type="ascend",
        ),
    )

    client = MinerUClient(
        backend="lmdeploy-engine",
        lmdeploy_engine=lmdeploy_engine,
    )

    image = Image.open("/path/to/the/test/image.png")
    extracted_blocks = client.two_step_extract(image)
    print(extracted_blocks)
```

### `vllm-engine` Example

```python
from vllm import LLM
from PIL import Image
from mineru_vl_utils import MinerUClient
from mineru_vl_utils import MinerULogitsProcessor  # if vllm>=0.10.1

llm = LLM(
    model="opendatalab/MinerU2.5-2509-1.2B",
    logits_processors=[MinerULogitsProcessor]  # if vllm>=0.10.1
)

client = MinerUClient(
    backend="vllm-engine",
    vllm_llm=llm
)

image = Image.open("/path/to/the/test/image.png")
extracted_blocks = client.two_step_extract(image)
print(extracted_blocks)
```

### `vllm-async-engine` Example

```python
import io
import asyncio
import aiofiles

from vllm.v1.engine.async_llm import AsyncLLM
from vllm.engine.arg_utils import AsyncEngineArgs
from PIL import Image
from mineru_vl_utils import MinerUClient
from mineru_vl_utils import MinerULogitsProcessor  # if vllm>=0.10.1

async_llm = AsyncLLM.from_engine_args(
    AsyncEngineArgs(
        model="opendatalab/MinerU2.5-2509-1.2B",
        logits_processors=[MinerULogitsProcessor]  # if vllm>=0.10.1
    )
)

client = MinerUClient(
  backend="vllm-async-engine",
  vllm_async_llm=async_llm,
)

async def main():
    image_path = "/path/to/the/test/image.png"
    async with aiofiles.open(image_path, "rb") as f:
        image_data = await f.read()
    image = Image.open(io.BytesIO(image_data))
    extracted_blocks = await client.aio_two_step_extract(image)
    print(extracted_blocks)

asyncio.run(main())

async_llm.shutdown()
```

## Other APIs

Besides the `two_step_extract` method, `MinerUClient` also provides other APIs
for interacting with the model. Following are the main APIs:

```python
class MinerUClient:

    def layout_detect(self, image: Image.Image) -> list[ContentBlock]:
        ...

    def batch_layout_detect(self, images: list[Image.Image]) -> list[list[ContentBlock]]:
        ...

    async def aio_layout_detect(self, image: Image.Image) -> list[ContentBlock]:
        ...

    async def aio_batch_layout_detect(self, images: list[Image.Image]) -> list[list[ContentBlock]]:
        ...

    def two_step_extract(self, image: Image.Image) -> list[ContentBlock]:
        ...

    def batch_two_step_extract(self, images: list[Image.Image]) -> list[list[ContentBlock]]:
        ...

    async def aio_two_step_extract(self, image: Image.Image) -> list[ContentBlock]:
        ...

    async def aio_batch_two_step_extract(self, images: list[Image.Image]) -> list[list[ContentBlock]]:
        ...
```

## Limitations

The `transformers` backend is slow and not suitable for production use.

The `MinerUClient` only supports standalone image(s) as input.
PDF and DOCX files are not planned to be supported.
Cross-page and cross-document operations are not planned to be supported, too.

For production use cases, please use [MinerU](https://github.com/opendatalab/mineru),
which is a more complete toolkit for document analyzing and data extraction.