# Llama.cpp.cuda10 **Repository Path**: goldenalien/llama.cpp.cuda10 ## Basic Information - **Project Name**: Llama.cpp.cuda10 - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: jetson - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-06-26 - **Last Updated**: 2026-01-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Llama.cpp ![llama](media/llama-nvidia.png) Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++. ## Description The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware. - Plain C/C++ implementation without any dependencies - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA) - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity ## Building the fork This fork was created using [this](https://gist.github.com/kreier/6871691130ec3ab907dd2815f9313c5d) instruction, based on `gcc 8.5` and `nvcc 10.2`. To use this, you will need the following software packages installed. The section "[Install prerequisites](https://gist.github.com/kreier/6871691130ec3ab907dd2815f9313c5d#install-prerequisites)" describes the process in detail. The installation of `gcc 8.5` and `cmake 3.27` of these might take several hours. - Nvidia CUDA Compiler nvcc 10.2 - `nvcc --version` - GCC and CXX (g++) 8.5 - `gcc --version` - cmake >= 3.14 - `cmake --version` - `nano`, `curl`, `libcurl4-openssl-dev`, `python3-pip` and `jtop` We need to add a few extra flags to the recommended first instruction `cmake -B build`, otherwise there are several error like *Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions).* that would stop the compilation. There will we a few *warning: constexpr if statements are a C++17 feature* after the second instruction, but we can ignore them. Let's start with the first one: ``` sh cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=off ``` And 15 seconds later we're ready for the last step, the instruction that will take **85 minutes** to have llama.cpp compiled: ``` sh cmake --build build --config Release ``` Now you can use binaries from `build/bin` folder. If you want to make binaries globally available, add this to your `~/.bashrc` file: ``` sh export PATH="$PATH:$HOME/Llama.cpp/build/bin" ``` ## [`llama-server`](tools/server) #### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs. -

Start a local HTTP server with default configuration on port 8080

```bash llama-server -m model.gguf --port 8080 # Basic web UI can be accessed via browser: http://localhost:8080 # Chat completion endpoint: http://localhost:8080/v1/chat/completions ```

Support multiple-users and parallel decoding

```bash # up to 4 concurrent requests, each with 4096 max context llama-server -m model.gguf -c 16384 -np 4 ```

Enable speculative decoding

```bash # the draft.gguf model should be a small variant of the target model.gguf llama-server -m model.gguf -md draft.gguf ```

Serve an embedding model

```bash # use the /embedding endpoint llama-server -m model.gguf --embedding --pooling cls -ub 8192 ```

Serve a reranking model

```bash # use the /reranking endpoint llama-server -m model.gguf --reranking ```

Constrain all outputs with a grammar

```bash # custom grammar llama-server -m model.gguf --grammar-file grammar.gbnf # JSON llama-server -m model.gguf --grammar-file grammars/json.gbnf ```