# 2025-hpk-hw2

**Repository Path**: gpuap/2025-hpk-hw2

## Basic Information

- **Project Name**: 2025-hpk-hw2
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-01-25
- **Last Updated**: 2026-01-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 简易vLLM推理服务
> 2025年秋季国科大《GPU架构与编程》大作业二项目代码：英伟达设备上的大模型推理服务


## 项目结构

- `Dockerfile`: 用于构建容器镜像的配置文件。
- `serve.py`: 推理服务的核心代码，这个程序不能访问Internet。
- `requirements.txt`: Python依赖列表。您可以添加您需要的库。
- `.gitignore`: Git版本控制忽略的文件列表。
- `download_model.py`: 下载权重的脚本，可以自行修改。
- `README.md`: 本说明文档。

## API规范

### 顺序推理

配置健康检查接口返回`{"status": "ok"}`，此时评测系统会顺序向 `/predict` 端点发送单条推理`POST` 请求，其JSON body格式为：

```json
{
  "prompt": "Your question here"
}
```
您的服务必须能够正确处理此请求，并返回一个JSON格式的响应，格式为：

```json
{
  "response": "Your model's answer here"
}
```

### Batch推理
配置健康检查接口返回`{"status": "batch"}`，此时评测系统会向 `/predict` 端点以列表形式一次性`POST`所有的推理请求 ，其JSON body格式为：

```json
[
  {
    "prompt": "Your question1 here"
  },
  {
    "prompt": "Your question2 here"
  }
  ...
]
```
您的服务必须能够正确处理此请求，并以JSON列表格式返回所有的响应，注意顺序保持不变，格式为：

```json
[
  {
    "response": "Your model's answer1 here"
  },
  {
    "response": "Your model's answer2 here"
  }
  ...
]
```

**请务必保持此API契约不变！**

## 平台配置说明

judge机器的配置如下：

``` text
os: ubuntu24.04
cpu: 14核
内存: 120GB
GPU: RTX5090(显存：32GB)
cuda版本: 13.0
网络带宽：100Mbps
```

judge系统的配置如下：

``` text
docker build stage: 1500s
docker run - health check stage: 180s
docker run - predict stage: 360s
```

## 模型
- 微调数据集：https://modelscope.cn/datasets/modeledom/ucas_gpu_data
- 微调模型(Qwen2.5-0.5B)：https://modelscope.cn/models/modeledom/ucas_gpu
- 预训练小模型(Qwen2-88M-instruct)：https://modelscope.cn/models/modeledom/qwen2-88M-instruct
- 微调小模型(Qwen2-88M)：https://modelscope.cn/models/modeledom/ucas_gpu_titan