# AntAngelMed-eagle3 **Repository Path**: hf-models/AntAngelMed-eagle3 ## Basic Information - **Project Name**: AntAngelMed-eagle3 - **Description**: Mirror of https://huggingface.co/MedAIBase/AntAngelMed-eagle3 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-05 - **Last Updated**: 2026-01-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- license: apache-2.0 base_model: - MedAIBase/AntAngelMed --- # AntAngelMed-eagle3 ## Model Overview **AntAngelMed-eagle3** is a high-performance draft model specifically designed for inference acceleration, leveraging advanced EAGLE3 speculative sampling technology to achieve a deep balance between inference performance and model stability. The model is trained on **high-quality medical datasets**, significantly boosting inference throughput while maintaining high accuracy, providing extreme performance for high-load production environments. ## Key Features - **Speculative Sampling Optimization**: Based on EAGLE3 technology, achieving high verification pass rate with speculative length of 4 - **Outstanding Throughput Performance**: FP8 quantization + EAGLE3 solution, throughput improvement up to 90+% - **Production-Grade Optimization**: Achieving 3267 tokens/s output throughput on single NVIDIA H200 ## Performance ### Speculative Sampling Efficiency Average Acceptance Length with speculative length of 4: | Benchmark | Average Acceptance Length | |-----------|---------------------------| | HumanEval | 2.816 | | GSM8K | 3.24 | | Math-500 | 3.326 | | Med_MCPA | 2.600 | | Health_Bench | 2.446 | ### Throughput Improvement Using **FP8 quantization + EAGLE3 optimization**, throughput improvement compared to FP8-only at 16 concurrency: | Benchmark | Throughput Improvement | |-----------|------------------------| | HumanEval | **+67.3%** | | GSM8K | **+58.6%** | | Math-500 | **+89.8%** | | Med_MCPA | **+46%** | | Health_Bench | **+45.3%** | ### Ultimate Inference Performance - **Hardware Environment**: NVIDIA H200 single GPU ![1](https://hackmd.io/_uploads/BJF9a7MNZe.png) ![2](https://hackmd.io/_uploads/H15K1NMV-e.png) ![3](https://hackmd.io/_uploads/H16nT7fN-e.png) *Figure: Throughput performance comparison and accuracy metrics under equal compute on 1xH200* ## Technical Specifications - **Model Architecture**: LlamaForCausalLMEagle3 - **Number of Layers**: 1 layer (Draft Model) - **Hidden Size**: 4096 - **Attention Heads**: 32 (KV heads: 8) - **Intermediate Size**: 14336 - **Vocabulary Size**: 157,184 - **Max Position Embeddings**: 32,768 - **Data Type**: bfloat16 ## Quick Start ### Requirements - H200-class Computational Performance - CUDA 12.0+ - PyTorch 2.0+ ### Installation ```bash pip install sglang==0.5.6 ``` and include PR https://github.com/sgl-project/sglang/pull/15119 ### Inference with SGLang ```python python3 -m sglang.launch_server \ --model-path MedAIBase/AntAngelMed-FP8 \ --host 0.0.0.0 --port 30012 \ --trust-remote-code \ --attention-backend fa3 \ --mem-fraction-static 0.9 \ --tp-size 1 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path MedAIBase/AntAngelMed-eagle3 \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 ``` ## Training Data - **Data Quality**: Rigorously filtered and cleaned to ensure high-quality training data ## Use Cases - High-concurrency inference services - Real-time dialogue systems - Code generation and completion - Mathematical reasoning and computation - Production environments requiring low-latency responses ## Open Source Contribution We actively contribute back to the open-source community. Related optimization achievements have been submitted to the **SGLang community**: - PR #15119: [EAGLE3 Optimization Implementation](https://github.com/sgl-project/sglang/pull/15119) ## Limitations and Notes - This model is a draft model that needs to be used with a target model to achieve speculative sampling - FP8 quantization is recommended for optimal performance - Performance may vary across different hardware platforms - Medical domain applications must comply with relevant regulations; model outputs are for reference only ## License This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).