# FunASR-nano-onnx

**Repository Path**: wawe/FunASR-nano-onnx

## Basic Information

- **Project Name**: FunASR-nano-onnx
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-23
- **Last Updated**: 2025-12-23

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# FunASR-Nano ONNX

ONNX export and inference implementation for FunASR-Nano model.

## Requirements

- Python >= 3.8
- PyTorch >= 2.0
- ONNX Runtime >= 1.15
- transformers
- funasr (for feature extraction)
- modelscope (for downloading models)

Install dependencies:

```bash
pip install -r requirements.txt
pip install modelscope
```

## Quick Start

### 1. Download Models

Download pre-trained ONNX models from ModelScope to the `models/` directory:

```bash
modelscope download --model zengshuishui/FunASR-nano-onnx --output_dir models
```

After downloading, the `models/` directory will contain:
- `encoder_adaptor.onnx` and `encoder_adaptor.onnx.data`
- `llm.onnx` and `llm.onnx.data`
- `encoder_adaptor.int8.onnx` and `llm.int8.onnx` (INT8 quantized versions)
- `embedding.onnx` and `embedding.int8.onnx`

### 2. Run Inference

**With ONNX Embedding Model (Recommended)**:

```bash
python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --embedding-model models/embedding.onnx \
    --wave examples/zh.mp3 \
    --prompt "语音转写：" \
    --max-new-tokens 512 \
    --device auto
```

**Without ONNX Embedding Model (requires model.safetensors)**:

```bash
python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --prompt "语音转写：" \
    --max-new-tokens 512 \
    --device auto
```

**Parameters**:
- `--device`: Inference device, options: `cpu`, `cuda`, or `auto` (default: `auto`, LLM uses CPU by default due to CUDA float16 issues)
- `--embedding-model`: Path to ONNX embedding model (optional, if not provided, will use PyTorch model from `--llm-tokenizer`)
- `--seed`: Random seed for reproducible results (default: 42)
- `--temperature`: Sampling temperature (default: 0.3)
- `--top-p`: Top-p (nucleus) sampling threshold (default: 0.8)

### 3. Export ONNX Models (Optional)

If you need to export ONNX models from the original model.pt:

#### Export Encoder+Adaptor

```bash
python scripts/export_encoder_adaptor_onnx.py \
    --model-pt /path/to/model.pt \
    --output-filename models/encoder_adaptor.onnx \
    --opset-version 18
```

#### Export LLM

```bash
python scripts/export_llm_onnx.py \
    --model-pt /path/to/model.pt \
    --llm-config-path /path/to/Qwen3-0.6B \
    --output-filename models/llm.onnx \
    --opset-version 18
```

#### Export Embedding Layer (Optional, Recommended)

To avoid loading the full PyTorch model during inference, you can export the embedding layer to ONNX:

```bash
python scripts/export_embedding_onnx.py \
    --llm-config-path /path/to/Qwen3-0.6B \
    --output-filename models/embedding.onnx \
    --opset-version 18 \
    --verify
```

The `--verify` flag will automatically verify the exported model by comparing with the PyTorch model. This script will:
1. Check if `model.safetensors` exists in the LLM config directory
2. Export the embedding layer to ONNX format
3. Create INT8 quantized version 
4. Verify the exported model (if `--verify` is used)

**Note**: The embedding ONNX model eliminates the need for `model.safetensors` during inference, reducing memory usage and startup time. The INT8 quantized version further reduces model size while maintaining accuracy.

## Model Description

### Encoder+Adaptor Model

- **Input**: Audio features `(batch, time, 560)`
- **Output**: LLM embeddings `(batch, time, 1024)`
- **Supports dynamic sequence length**

### LLM Model

- **Input**: 
  - `inputs_embeds`: `(batch, sequence_length, 1024)`
  - `attention_mask`: `(batch, sequence_length)`
- **Output**: `logits`: `(batch, sequence_length, vocab_size)`
- **Supports dynamic sequence length**

### Embedding Model (Optional)

- **Input**: `input_ids`: `(batch, sequence_length)` - Token IDs (int64)
- **Output**: `embeddings`: `(batch, sequence_length, 1024)` - Token embeddings
- **Supports dynamic sequence length**
- **Purpose**: Converts token IDs to embeddings, eliminating the need for full PyTorch model during inference

### GPU Acceleration

Make sure `onnxruntime-gpu` is installed:

```bash
pip install onnxruntime-gpu
```

Note: Due to CUDA provider issues with float16, the LLM model uses CPU by default. The Encoder+Adaptor model can use GPU if available.

Use GPU for Encoder+Adaptor (LLM uses CPU):

```bash
python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --device cuda
```

Use CPU for all models:

```bash
python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --device cpu
```

## License

Please refer to the license of the original FunASR project.

## Acknowledgments

- Based on the [FunASR](https://github.com/alibaba-damo-academy/FunASR) project.
- Code structure and ONNX export implementation inspired by [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx).