# tiny-llm

**Repository Path**: kaiwgitee/tiny-llm

## Basic Information

- **Project Name**: tiny-llm
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-08
- **Last Updated**: 2025-06-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# tiny-llm - LLM Serving in a Week

[![CI (main)](https://github.com/skyzh/tiny-llm/actions/workflows/main.yml/badge.svg)](https://github.com/skyzh/tiny-llm/actions/workflows/main.yml)

A course on LLM serving using MLX for system engineers. The codebase
is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we
can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (e.g., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

## Book

The tiny-llm book is available at [https://skyzh.github.io/tiny-llm/](https://skyzh.github.io/tiny-llm/). You can follow the guide and start building.

## Community

You may join skyzh's Discord server and study with the tiny-llm community.

[![Join skyzh's Discord Server](book/src/discord-badge.svg)](https://skyzh.dev/join/discord)

## Roadmap

Week 1 is complete. Week 2 is in progress.

| Week + Chapter | Topic                                                       | Code | Test | Doc |
| -------------- | ----------------------------------------------------------- | ---- | ---- | --- |
| 1.1            | Attention                                                   | ✅    | ✅   | ✅  |
| 1.2            | RoPE                                                        | ✅    | ✅   | ✅  |
| 1.3            | Grouped Query Attention                                     | ✅    | ✅   | ✅  |
| 1.4            | RMSNorm and MLP                                             | ✅    | ✅   | ✅  |
| 1.5            | Load the Model                                              | ✅    | ✅   | ✅  |
| 1.6            | Generate Responses (aka Decoding)                           | ✅    | ✅   | ✅  |
| 1.7            | Sampling                                                    | ✅    | ✅   | ✅  |
| 2.1            | Key-Value Cache                                             | ✅    | 🚧   | 🚧  |
| 2.2            | Quantized Matmul and Linear - CPU                           | ✅    | 🚧   | 🚧  |
| 2.3            | Quantized Matmul and Linear - GPU                           | ✅    | 🚧   | 🚧  |
| 2.4            | Flash Attention 2 - CPU                                     | ✅    | 🚧   | 🚧  |
| 2.5            | Flash Attention 2 - GPU                                     | ✅    | 🚧   | 🚧  |
| 2.6            | Continuous Batching                                         | ✅    | 🚧   | 🚧  |
| 2.7            | Chunked Prefill                                             | ✅    | 🚧   | 🚧  |
| 3.1            | Paged Attention - Part 1                                    | 🚧    | 🚧   | 🚧  |
| 3.2            | Paged Attention - Part 2                                    | 🚧    | 🚧   | 🚧  |
| 3.3            | MoE (Mixture of Experts)                                    | 🚧    | 🚧   | 🚧  |
| 3.4            | Speculative Decoding                                        | 🚧    | 🚧   | 🚧  |
| 3.5            | Prefill-Decode Separation (requires two Macintosh devices)  | 🚧    | 🚧   | 🚧  |
| 3.6            | Parallelism                                                 | 🚧    | 🚧   | 🚧  |
| 3.7            | AI Agent     / Tool Calling                                 | 🚧    | 🚧   | 🚧  |

Other topics not covered: quantized/compressed kv cache, prefix/prompt cache; sampling, fine tuning; smaller kernels (softmax, silu, etc)