# mlx-server-OAI-compat **Repository Path**: zhch158_admin/mlx-server-OAI-compat ## Basic Information - **Project Name**: mlx-server-OAI-compat - **Description**: mlx server - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-24 - **Last Updated**: 2025-04-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # mlx-server-OAI-compat ## Description This repository hosts a high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface. > **Note:** This project currently supports **MacOS with M-series chips** only as it specifically leverages MLX, Apple's framework optimized for Apple Silicon. ## Demo ### 🚀 See It In Action Check out our [video demonstration](https://youtu.be/BMXOWK1Okk4) to see the server in action! The demo showcases: - Setting up and launching the server - Using the OpenAI Python SDK for seamless integration

## OpenAI Compatibility This server implements the OpenAI API interface, allowing you to use it as a drop-in replacement for OpenAI's services in your applications. It supports: - Chat completions (both streaming and non-streaming) - Vision-language model interactions - Text embeddings generation (with text-only models) - Standard OpenAI request/response formats - Common OpenAI parameters (temperature, top_p, etc.) ## Supported Model Types The server supports two types of MLX models: 1. **Text-only models** (`--model-type lm`) - Uses the `mlx-lm` library for pure language models 2. **Vision-language models** (`--model-type vlm`) - Uses the `mlx-vlm` library for multimodal models that can process both text and images ## Installation Follow these steps to set up the MLX-powered server: ### Prerequisites - MacOS with Apple Silicon (M-series) chip - Python 3.11 or later (native ARM version) - pip package manager ### Setup Steps 1. Create a virtual environment for the project: ```bash python3 -m venv oai-compat-server ``` 2. Activate the virtual environment: ```bash source oai-compat-server/bin/activate ``` 3. Install the package: ```bash # Option 1: Install directly from GitHub pip install git+https://github.com/cubist38/mlx-server-OAI-compat.git # Option 2: Clone and install in development mode git clone https://github.com/cubist38/mlx-server-OAI-compat.git cd mlx-server-OAI-compat pip install -e . ``` ### Troubleshooting **Issue:** My OS and Python versions meet the requirements, but `pip` cannot find a matching distribution. **Cause:** You might be using a non-native Python version. Run the following command to check: ```bash python -c "import platform; print(platform.processor())" ``` If the output is `i386` (on an M-series machine), you are using a non-native Python. Switch to a native Python version. A good approach is to use [Conda](https://stackoverflow.com/questions/65415996/how-to-specify-the-architecture-or-platform-for-a-new-conda-environment-apple). ## Usage ### Starting the Server To start the MLX server, activate the virtual environment and run the main application file: ```bash source oai-compat-server/bin/activate python -m app.main \ --model-path \ --model-type \ --max-concurrency 1 \ --queue-timeout 300 \ --queue-size 100 ``` #### Server Parameters - `--model-path`: Path to the MLX model directory (local path or Hugging Face model repository) - `--model-type`: Type of model to run (`lm` for text-only models, `vlm` for vision-language models). Default: `lm` - `--max-concurrency`: Maximum number of concurrent requests (default: 1) - `--queue-timeout`: Request timeout in seconds (default: 300) - `--queue-size`: Maximum queue size for pending requests (default: 100) - `--port`: Port to run the server on (default: 8000) - `--host`: Host to run the server on (default: 0.0.0.0) #### Example Configurations Text-only model: ```bash python -m app.main \ --model-path mlx-community/gemma-3-4b-it-4bit \ --model-type lm \ --max-concurrency 1 \ --queue-timeout 300 \ --queue-size 100 ``` > **Note:** Text embeddings via the `/v1/embeddings` endpoint are only available with text-only models (`--model-type lm`). Vision-language model: ```bash python -m app.main \ --model-path mlx-community/llava-phi-3-vision-4bit \ --model-type vlm \ --max-concurrency 1 \ --queue-timeout 300 \ --queue-size 100 ``` ### Using the API The server provides OpenAI-compatible endpoints that you can use with standard OpenAI client libraries. Here are some examples: #### Text Completion ```python import openai client = openai.OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" # API key is not required for local server ) response = client.chat.completions.create( model="local-model", # Model name doesn't matter for local server messages=[ {"role": "user", "content": "What is the capital of France?"} ], temperature=0.7 ) print(response.choices[0].message.content) ``` #### Vision-Language Model ```python import openai import base64 client = openai.OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" ) # Load and encode image with open("image.jpg", "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode('utf-8') response = client.chat.completions.create( model="local-vlm", # Model name doesn't matter for local server messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" } } ] } ] ) print(response.choices[0].message.content) ``` #### Embeddings ```python import openai client = openai.OpenAI( base_url="http://localhost:8000/v1", api_key="not-needed" ) # Generate embeddings for a single text embedding_response = client.embeddings.create( model="local-model", # Model name doesn't matter for local server input=["The quick brown fox jumps over the lazy dog"] ) print(f"Embedding dimension: {len(embedding_response.data[0].embedding)}") # Generate embeddings for multiple texts batch_response = client.embeddings.create( model="local-model", input=[ "Machine learning algorithms improve with more data", "Natural language processing helps computers understand human language", "Computer vision allows machines to interpret visual information" ] ) print(f"Number of embeddings: {len(batch_response.data)}") ``` ### CLI Usage You can also use the provided CLI command to launch the server: ```bash mlx-server launch --model-path --model-type --port 8000 ``` All parameters available in the Python version are also available in the CLI: ```bash mlx-server launch \ --model-path mlx-community/gemma-3-4b-it-4bit \ --model-type vlm \ --port 8000 \ --max-concurrency 1 \ --queue-timeout 300 \ --queue-size 100 ``` #### CLI Commands ```bash # Get help mlx-server --help mlx-server launch --help # Check version mlx-server --version ``` ## Request Queue System The server implements a robust request queue system to manage and optimize MLX model inference requests. This system ensures efficient resource utilization and fair request processing. ### Key Features - **Concurrency Control**: Limits the number of simultaneous model inferences to prevent resource exhaustion - **Request Queuing**: Implements a fair, first-come-first-served queue for pending requests - **Timeout Management**: Automatically handles requests that exceed the configured timeout - **Real-time Monitoring**: Provides endpoints to monitor queue status and performance metrics ### Architecture The queue system consists of two main components: 1. **RequestQueue**: An asynchronous queue implementation that: - Manages pending requests with configurable queue size - Controls concurrent execution using semaphores - Handles timeouts and errors gracefully - Provides real-time queue statistics 2. **Model Handlers**: Specialized handlers for different model types: - `MLXLMHandler`: Manages text-only model requests - `MLXVLMHandler`: Manages vision-language model requests ### Queue Monitoring Monitor queue statistics using the `/v1/queue/stats` endpoint: ```bash curl http://localhost:8000/v1/queue/stats ``` Example response: ```json { "status": "ok", "queue_stats": { "running": true, "queue_size": 3, "max_queue_size": 100, "active_requests": 5, "max_concurrency": 2 } } ``` ### Error Handling The queue system handles various error conditions: 1. **Queue Full (429)**: When the queue reaches its maximum size ```json { "detail": "Too many requests. Service is at capacity." } ``` 2. **Request Timeout**: When a request exceeds the configured timeout ```json { "detail": "Request processing timed out after 300 seconds" } ``` 3. **Model Errors**: When the model encounters an error during inference ```json { "detail": "Failed to generate response: " } ``` ## Performance Monitoring The server includes comprehensive performance monitoring to help track and optimize model performance. ### Key Metrics - **Tokens Per Second (TPS)**: Real-time tracking of token generation speed - **Time To First Token (TTFT)**: Measures latency from request to first token generation - **Throughput**: Tracks overall request processing capacity (requests per second) - **Total Requests**: Cumulative count of all processed requests - **Error Count**: Tracks the number of failed requests ### Metrics System Architecture The performance metrics system uses a `RequestMetrics` class that: 1. **Rolling Averages**: Maintains running averages for key performance indicators 2. **Request-type Tracking**: Logs metrics by request type (chat completions, embeddings, etc.) 3. **Automatic Logging**: Records key metrics for each request completion 4. **Token Estimation**: Provides approximate token count estimation for text inputs ### Performance Metrics Access detailed performance metrics through the `/v1/queue/stats` endpoint: ```bash curl http://localhost:8000/v1/queue/stats ``` Example response with performance data: ```json { "status": "ok", "queue_stats": { "running": true, "queue_size": 3, "max_queue_size": 100, "active_requests": 5, "max_concurrency": 2 }, "metrics": { "total_requests": 100, "performance": { "tps": 99.0, "ttft": 150.0, "throughput": 2.5 }, "error_count": 2 } } ``` ### Performance Optimization Based on metrics collected, you can optimize server performance: 1. **Concurrency Tuning**: Adjust `--max-concurrency` based on throughput metrics 2. **Queue Size Management**: Configure `--queue-size` to handle expected request volumes 3. **Model Selection**: Compare TPS across different quantization levels (4-bit vs 8-bit) 4. **Resource Allocation**: Monitor TTFT to ensure acceptable response latency ## API Usage ### Text-Only Model Example ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-3-4b-it-4bit", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the capital of France?" } ], "stream": false, "max_tokens": 256, "temperature": 0.7 }' ``` ### Vision Model Example You can make vision requests to analyze images using the `/v1/chat/completions` endpoint when running with a VLM model: ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llava-phi-3-vision-4bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ], "stream": false, "max_tokens": 256 }' ``` ### Embeddings Example ```bash curl localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-MLX-Q8", "input": ["The quick brown fox jumps over the lazy dog"] }' ``` Response format: ```json { "object": "list", "data": [ { "object": "embedding", "embedding": [0.0123, ..., 0.9876], "index": 0 } ], "model": "mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-MLX-Q8" } ``` You can also generate embeddings for multiple texts in a single request: ```bash curl localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "mlx-community/DeepSeek-R1-Distill-Qwen-1.5B-MLX-Q8", "input": [ "The quick brown fox jumps over the lazy dog", "Machine learning models require training data", "Neural networks are inspired by biological neurons" ] }' ``` > **Note:** The server currently supports text embeddings only with `--model-type lm` (text-only models). Embeddings are not yet supported with vision-language models. See the included `examples/embeddings_examples.ipynb` notebook for detailed examples of using embeddings for semantic search, similarity comparison, and other applications. > **Warning:** Make sure you're running the server with `--model-type vlm` when making vision requests. If you send a vision request to a server running with `--model-type lm` (text-only model), you'll receive a 400 error with a message that vision requests are not supported with text-only models. ### Request Format - `model`: Optional model identifier (the server will use the loaded model regardless) - `messages`: Array of message objects containing: - `role`: The role of the message sender ("user", "assistant", or "system") - `content`: - For text models: A string containing the message - For vision models: An array of content objects: - `type`: Either "text" or "image_url" - `text`: The text prompt (for type "text") - `image_url`: Object containing the image URL (for type "image_url") - `stream`: Optional boolean to enable streaming responses - Additional parameters: `temperature`, `max_tokens`, `top_p`, etc. For embeddings: - `model`: Optional model identifier - `input`: String or array of strings to generate embeddings for ### Response Format The server will return responses in OpenAI-compatible format: ```json { "id": "chatcmpl-1234567890", "object": "chat.completion", "created": 1234567890, "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The capital of France is Paris." }, "finish_reason": "stop" } ] } ``` ### Streaming Responses For streaming responses, add `"stream": true` to your request. The response will be in Server-Sent Events (SSE) format: ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "stream": true, "messages": [ { "role": "user", "content": "Tell me about Paris" } ] }' ``` ### Multi-turn Conversations The API supports multi-turn conversations for both text-only and vision models: #### Text-Only Model Example: ```json { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the capital of France?" }, { "role": "assistant", "content": "The capital of France is Paris." }, { "role": "user", "content": "Tell me some interesting facts about it." } ] } ``` #### Vision Model Example: ```json { "messages": [ { "role": "system", "content": "You are a helpful assistant that describes images." }, { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } } ] }, { "role": "assistant", "content": "The image shows a wooden boardwalk..." }, { "role": "user", "content": "Are there any people in the image?" } ] } ``` ## API Response Schemas The server implements comprehensive Pydantic schemas for request and response handling, ensuring type safety and validation: ### Request Schemas - `ChatCompletionRequest`: Handles chat completion requests with support for: - Text and vision messages - Streaming options - Model parameters (temperature, top_p, etc.) - Tool calls and function calling - `EmbeddingRequest`: Manages embedding generation requests ### Response Schemas - `ChatCompletionResponse`: Standard chat completion responses - `ChatCompletionChunk`: Streaming response chunks - `EmbeddingResponse`: Embedding generation responses - `ErrorResponse`: Standardized error responses Example response structure: ```python { "id": "chatcmpl-1234567890", "object": "chat.completion", "created": 1234567890, "model": "local-model", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "The response content" }, "finish_reason": "stop" }] } ``` ### Streaming Responses The server supports streaming responses with proper chunk formatting: ```python { "id": "chatcmpl-1234567890", "object": "chat.completion.chunk", "created": 1234567890, "model": "local-model", "choices": [{ "index": 0, "delta": {"content": "chunk of text"}, "finish_reason": null }] } ``` ## Example Notebooks The repository includes example notebooks to help you get started with different aspects of the API: - **vision_examples.ipynb**: A comprehensive guide to using the vision capabilities of the API, including: - Processing image inputs in various formats - Vision analysis and object detection - Multi-turn conversations with images - Using vision models for detailed image description and analysis - **embeddings_examples.ipynb**: A comprehensive guide to using the embeddings API, including: - Generating embeddings for single and batch inputs - Computing semantic similarity between texts - Building a simple vector-based search system - Comparing semantic relationships between concepts ## Contributing We welcome contributions to improve this project! Here's how you can contribute: 1. Fork the repository to your GitHub account. 2. Create a new branch for your feature or bug fix: ```bash git checkout -b feature-name ``` 3. Commit your changes with clear and concise messages: ```bash git commit -m "Add feature-name" ``` 4. Push your branch to your forked repository: ```bash git push origin feature-name ``` 5. Open a pull request to the main repository for review. ## License This project is licensed under the [MIT License](LICENSE). You are free to use, modify, and distribute it under the terms of the license. ## Support If you encounter any issues or have questions, please: - Open an issue in the repository. - Contact the maintainers via the provided contact information. Stay tuned for updates and enhancements! ## Acknowledgments We extend our heartfelt gratitude to the following individuals and organizations whose contributions have been instrumental in making this project possible: ### Core Technologies - [MLX team](https://github.com/ml-explore/mlx) for developing the groundbreaking MLX framework, which provides the foundation for efficient machine learning on Apple Silicon - [mlx-lm](https://github.com/ml-explore/mlx-lm) for efficient large language models support - [mlx-vlm](https://github.com/Blaizzy/mlx-vlm/tree/main) for pioneering multimodal model support within the MLX ecosystem - [mlx-community](https://huggingface.co/mlx-community) for curating and maintaining a diverse collection of high-quality MLX models ### Open Source Community We deeply appreciate the broader open-source community for their invaluable contributions. Your dedication to: - Innovation in machine learning and AI - Collaborative development practices - Knowledge sharing and documentation - Continuous improvement of tools and frameworks Your collective efforts continue to drive progress and make projects like this possible. We are proud to be part of this vibrant ecosystem. ### Special Thanks A special acknowledgment to all contributors, users, and supporters who have helped shape this project through their feedback, bug reports, and suggestions. Your engagement helps make this project better for everyone.