# 本地模型部署与调用-流式输出

**Repository Path**: fubob/project_localllm_deploy_calling_encapsulation

## Basic Information

- **Project Name**: 本地模型部署与调用-流式输出
- **Description**: Deploy the large model locally, call for testing, and encapsulate it using FastAPI for testing
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-10
- **Last Updated**: 2025-08-05

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

### 🧠Project Introduction

Build a lightweight local REST API that simulates a core feature of ModelVault’s product: receiving a prompt and returning a generated response.


### 🧪Project Structure

```
minivault-api/ 
├── app.py            # API code 
├── logs/app.log      # Logs of prompt/response 
├── requirements.txt  # pip install requirement
├── test.gif          # test process
└── README.md         
```


### ⚙️Deploy local LLM

- **Deploy CLI**

```
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model="/root/autodl-tmp/Qwen2.5-0.5B-Instruct" --tensor-parallel-size 1 --gpu-memory-utilization=0.9 --max-num-seqs=256  --max-model-len=2048  --max-num-batched-tokens=2048  --served-model-name Qwen --host 0.0.0.0 --port 8000 --trust-remote-code
```

![image-20250709231846340](https://gitee.com/fubob/note-pic/raw/master/image/image-20250709231846340.png)

- **API TEST**

```
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "Qwen",
    "messages": [
        {
            "role": "assistant",
            "content": "hello,who are you?"
        }
    ],
    "max_tokens": 250,
    "top_k": -1,
    "top_p": 1,
    "temperature": 0,
    "ignore_eos": false,
    "stream": false
}'
```


### ⚙️Minimal Local REST API

Request **Local REST API** through the **request** library, with optional parameters such as **max_token, top_k, top_p, temperature, ignore-eos, stream**, etc

```python
def client_infer(self,prompt):
    logger.info(f"Starting client_infer with prompt: {prompt}")
    try:
        data = {
            "model": self.model_name,
            "messages": [
                {
                    "role": "assistant",
                    "content": prompt
                }
            ],
            "max_tokens": 800,
            "top_k": -1,
            "top_p": 1,
            "temperature": 0.7,
            "ignore_eos": False,
            "stream": False
        }
        response = requests.post(self.url, json=data)
        result_all = json.loads(response.text)
        result = result_all['choices'][0]['message']['content']
        # result_dic = {"request_data": data, "response_result": result}
        logger.info("client_infer completed successfully")
        return {"response": result}
    except Exception as e:
        logger.error(f"Error in client_infer: {e}", exc_info=True)
        raise e
```

****


### ⚙️Stream Output

Set stream to true, accept the byte stream returned by the model

```python
    def client_infer_stream(self, prompt):
        logger.info(f"Starting client_infer_stream with prompt: {prompt}")
        try:
            data = {
                "model": self.model_name,
                "messages": [
                    {
                        "role": "assistant",
                        "content": prompt
                    }
                ],
                "max_tokens": 800,
                "top_k": -1,
                "top_p": 1,
                "temperature": 0.7,
                "ignore_eos": False,
                "stream": True
            }
            response = requests.post(self.url, json=data)
            # response = requests.request("POST", self.api_url, headers=headers, data=data)
            if response.status_code == 200:
                # Process and yield each chunk from the response
                for line in response.iter_lines():
                    if not line:  # Skip blank lines (such as SSE's heartbeat or end flag)
                        continue

                    # Remove possible 'data:' prefix (compatible with SSE format)
                    line = line.decode('utf-8').strip()
                    if line.startswith("data:"):
                        line = line[5:].strip()

                    # Check the end of flow flag (such as OpenAI's [DONE])
                    if line == "[DONE]":
                        break

                    try:
                        # Remove 'data: ' prefix and parse JSON
                        json_data = json.loads(line)
                        # Extract and yield only the 'text' field from the nested 'data' object
                        if (
                                isinstance(json_data, dict)
                                and "choices" in json_data
                                and len(json_data["choices"]) > 0
                                and "delta" in json_data["choices"][0]
                                and "content" in json_data["choices"][0]["delta"]
                        ):
                            content = json_data["choices"][0]["delta"]["content"]
                            print(content)
                            if content:  # 仅返回非空内容
                                logger.debug(f"Stream content: {content}")
                                yield f"data: {content} \n\n"

                    except json.JSONDecodeError:
                        logger.warning(f"JSON decode error, raw data: {line}")
                        yield f"JSON decode error, raw data: {line}"
                    except KeyError as e:
                        logger.warning(f"Missing expected field: {e}, raw data: {line}")
                        yield f"Missing expected field: {e}, raw data: {line}"
                    except Exception as e:
                        logger.error(f"Unknown error: {e}, raw data: {line}", exc_info=True)
                        yield f"Unknown error: {e}, raw data: {line}"
            else:
                logger.error(f"Request failed with status code: {response.status_code}")
                yield f"Request failed with status code: {response.status_code}"

        except Exception as e:
            logger.error(f"Error in client_infer_stream: {e}", exc_info=True)
            yield f"Error in client_infer_stream: {e}"
```

****


### ⚙️Interface Encapsulation

Encapsulated **generate** and **generate_stream** Interface , through fastAPI

```python
@app.post("/generate/")
async def generate(item: Item):
    logger.info(f"Generate endpoint called with prompt: {item.prompt}")
    result = local_llm_client.client_infer(item.prompt)
    return result

@app.post("/generate_stream/")
async def generate_steam(item: Item):
    logger.info(f"Generate_stream endpoint called with prompt: {item.prompt}")
    return StreamingResponse(local_llm_client.client_infer_stream(item.prompt), media_type="text/event-stream")
```

****


### 🔧Test

**test API with APIFOX tools**

![test](https://gitee.com/fubob/note-pic/raw/master/image/test.gif)

**test API with CLI**

```cmd
curl --location --request POST 'http://localhost:50000/generate/' \
--header 'Content-Type: application/json' \
--data-raw '{
    "prompt": "hello,who are you?"
}'
```

```cmd
curl --location --request POST 'http://localhost:50000/generate_stream/' \
--header 'Accept: text/event-stream' \
--header 'Content-Type: application/json' \
--data-raw '{
    "prompt": "hello,who are you?"
}'
```