# 本地模型部署与调用-流式输出 **Repository Path**: fubob/project_localllm_deploy_calling_encapsulation ## Basic Information - **Project Name**: 本地模型部署与调用-流式输出 - **Description**: Deploy the large model locally, call for testing, and encapsulate it using FastAPI for testing - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-10 - **Last Updated**: 2025-08-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### 🧠Project Introduction Build a lightweight local REST API that simulates a core feature of ModelVault’s product: receiving a prompt and returning a generated response. ### 🧪Project Structure ``` minivault-api/ ├── app.py # API code ├── logs/app.log # Logs of prompt/response ├── requirements.txt # pip install requirement ├── test.gif # test process └── README.md ``` ### ⚙️Deploy local LLM - **Deploy CLI** ``` CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model="/root/autodl-tmp/Qwen2.5-0.5B-Instruct" --tensor-parallel-size 1 --gpu-memory-utilization=0.9 --max-num-seqs=256 --max-model-len=2048 --max-num-batched-tokens=2048 --served-model-name Qwen --host 0.0.0.0 --port 8000 --trust-remote-code ``` ![image-20250709231846340](https://gitee.com/fubob/note-pic/raw/master/image/image-20250709231846340.png) - **API TEST** ``` curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen", "messages": [ { "role": "assistant", "content": "hello,who are you?" } ], "max_tokens": 250, "top_k": -1, "top_p": 1, "temperature": 0, "ignore_eos": false, "stream": false }' ``` ### ⚙️Minimal Local REST API Request **Local REST API** through the **request** library, with optional parameters such as **max_token, top_k, top_p, temperature, ignore-eos, stream**, etc ```python def client_infer(self,prompt): logger.info(f"Starting client_infer with prompt: {prompt}") try: data = { "model": self.model_name, "messages": [ { "role": "assistant", "content": prompt } ], "max_tokens": 800, "top_k": -1, "top_p": 1, "temperature": 0.7, "ignore_eos": False, "stream": False } response = requests.post(self.url, json=data) result_all = json.loads(response.text) result = result_all['choices'][0]['message']['content'] # result_dic = {"request_data": data, "response_result": result} logger.info("client_infer completed successfully") return {"response": result} except Exception as e: logger.error(f"Error in client_infer: {e}", exc_info=True) raise e ``` **** ### ⚙️Stream Output Set stream to true, accept the byte stream returned by the model ```python def client_infer_stream(self, prompt): logger.info(f"Starting client_infer_stream with prompt: {prompt}") try: data = { "model": self.model_name, "messages": [ { "role": "assistant", "content": prompt } ], "max_tokens": 800, "top_k": -1, "top_p": 1, "temperature": 0.7, "ignore_eos": False, "stream": True } response = requests.post(self.url, json=data) # response = requests.request("POST", self.api_url, headers=headers, data=data) if response.status_code == 200: # Process and yield each chunk from the response for line in response.iter_lines(): if not line: # Skip blank lines (such as SSE's heartbeat or end flag) continue # Remove possible 'data:' prefix (compatible with SSE format) line = line.decode('utf-8').strip() if line.startswith("data:"): line = line[5:].strip() # Check the end of flow flag (such as OpenAI's [DONE]) if line == "[DONE]": break try: # Remove 'data: ' prefix and parse JSON json_data = json.loads(line) # Extract and yield only the 'text' field from the nested 'data' object if ( isinstance(json_data, dict) and "choices" in json_data and len(json_data["choices"]) > 0 and "delta" in json_data["choices"][0] and "content" in json_data["choices"][0]["delta"] ): content = json_data["choices"][0]["delta"]["content"] print(content) if content: # 仅返回非空内容 logger.debug(f"Stream content: {content}") yield f"data: {content} \n\n" except json.JSONDecodeError: logger.warning(f"JSON decode error, raw data: {line}") yield f"JSON decode error, raw data: {line}" except KeyError as e: logger.warning(f"Missing expected field: {e}, raw data: {line}") yield f"Missing expected field: {e}, raw data: {line}" except Exception as e: logger.error(f"Unknown error: {e}, raw data: {line}", exc_info=True) yield f"Unknown error: {e}, raw data: {line}" else: logger.error(f"Request failed with status code: {response.status_code}") yield f"Request failed with status code: {response.status_code}" except Exception as e: logger.error(f"Error in client_infer_stream: {e}", exc_info=True) yield f"Error in client_infer_stream: {e}" ``` **** ### ⚙️Interface Encapsulation Encapsulated **generate** and **generate_stream** Interface , through fastAPI ```python @app.post("/generate/") async def generate(item: Item): logger.info(f"Generate endpoint called with prompt: {item.prompt}") result = local_llm_client.client_infer(item.prompt) return result @app.post("/generate_stream/") async def generate_steam(item: Item): logger.info(f"Generate_stream endpoint called with prompt: {item.prompt}") return StreamingResponse(local_llm_client.client_infer_stream(item.prompt), media_type="text/event-stream") ``` **** ### 🔧Test **test API with APIFOX tools** ![test](https://gitee.com/fubob/note-pic/raw/master/image/test.gif) **test API with CLI** ```cmd curl --location --request POST 'http://localhost:50000/generate/' \ --header 'Content-Type: application/json' \ --data-raw '{ "prompt": "hello,who are you?" }' ``` ```cmd curl --location --request POST 'http://localhost:50000/generate_stream/' \ --header 'Accept: text/event-stream' \ --header 'Content-Type: application/json' \ --data-raw '{ "prompt": "hello,who are you?" }' ```