# Llama.cpp.cuda10
**Repository Path**: goldenalien/llama.cpp.cuda10
## Basic Information
- **Project Name**: Llama.cpp.cuda10
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: jetson
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-26
- **Last Updated**: 2026-01-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Llama.cpp

Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++.
## Description
The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
- Plain C/C++ implementation without any dependencies
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
## Building the fork
This fork was created using [this](https://gist.github.com/kreier/6871691130ec3ab907dd2815f9313c5d) instruction, based on `gcc 8.5` and `nvcc 10.2`. To use this, you will need the following software packages installed. The section "[Install prerequisites](https://gist.github.com/kreier/6871691130ec3ab907dd2815f9313c5d#install-prerequisites)" describes the process in detail. The installation of `gcc 8.5` and `cmake 3.27` of these might take several hours.
- Nvidia CUDA Compiler nvcc 10.2 - `nvcc --version`
- GCC and CXX (g++) 8.5 - `gcc --version`
- cmake >= 3.14 - `cmake --version`
- `nano`, `curl`, `libcurl4-openssl-dev`, `python3-pip` and `jtop`
We need to add a few extra flags to the recommended first instruction `cmake -B build`, otherwise there are several error like *Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions).* that would stop the compilation. There will we a few *warning: constexpr if statements are a C++17 feature* after the second instruction, but we can ignore them. Let's start with the first one:
``` sh
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=off
```
And 15 seconds later we're ready for the last step, the instruction that will take **85 minutes** to have llama.cpp compiled:
``` sh
cmake --build build --config Release
```
Now you can use binaries from `build/bin` folder. If you want to make binaries globally available, add this to your `~/.bashrc` file:
``` sh
export PATH="$PATH:$HOME/Llama.cpp/build/bin"
```
## [`llama-server`](tools/server)
#### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
-
Start a local HTTP server with default configuration on port 8080
```bash
llama-server -m model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
```
-
Support multiple-users and parallel decoding
```bash
# up to 4 concurrent requests, each with 4096 max context
llama-server -m model.gguf -c 16384 -np 4
```
-
Enable speculative decoding
```bash
# the draft.gguf model should be a small variant of the target model.gguf
llama-server -m model.gguf -md draft.gguf
```
-
Serve an embedding model
```bash
# use the /embedding endpoint
llama-server -m model.gguf --embedding --pooling cls -ub 8192
```
-
Serve a reranking model
```bash
# use the /reranking endpoint
llama-server -m model.gguf --reranking
```
-
Constrain all outputs with a grammar
```bash
# custom grammar
llama-server -m model.gguf --grammar-file grammar.gbnf
# JSON
llama-server -m model.gguf --grammar-file grammars/json.gbnf
```