# woolycore **Repository Path**: yonehsiung/woolycore ## Basic Information - **Project Name**: woolycore - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-09-03 - **Last Updated**: 2024-09-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # woolycore A thin C wrapper around the [llama.cpp library](https://github.com/ggerganov/llama.cpp), aiming for a high-level API that provides a surface to build FFI libraries on top of for other languages. Upstream llama.cpp is pinned to commit [EA5D747](https://github.com/ggerganov/llama.cpp/releases/tag/b3649) from Aug 31, 2024. At present, it is in development and the API is unstable, though no breaking changes are envisioned. Supported Operating Systems: Windows, MacOS, Linux, iOS, Android Note: Android support appears to be non-accelerated. ## List of programming language integrations The following projects use woolycore to create wrappers around the [llama.cpp library](https://github.com/ggerganov/llama.cpp): * Dart: [woolydart](https://github.com/tbogdala/woolydart) * Rust: [woolyrust](https://github.com/tbogdala/woolyrust) ## License MIT licensed, like the core upstream `llama.cpp` it wraps. See `LICENSE` for details. ## Features * Basic samplers of llama.cpp, including: temp, top-k, top-p, min-p, tail free sampling, locally typical sampling, mirostat. * Support for llama.cpp's BNF-like grammar rules for sampling. * Ability to cache the processed prompt data in memory so that it can be reused to speed up regeneration using the exact same prompt. ## Build notes To build, use the following commands: ```bash cmake -B build cmake --build build --config Release ``` This should automatically include Metal support and embed the shaders if the library is being built on MacOS. Other platforms will only have CPU support without additional flags. For systems that support CUDA acceleration, like Linux, you'll need to enable it using an additional compilation flag as follows: ```bash cmake -B build -DGGML_CUDA=On -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=TRUE -DBUILD_SHARED_LIBS=TRUE cmake --build build --config Release ``` On Windows, you'll need a few more flags to make sure that all the functions are exported and available on the compiled library (And the resulting `build/Release/woolycore.dll` file will need to be able to find the `build/bin/Release/ggml.dll` and `build/bin/Release/llama.dll` files at runtime when deployed, so copy the three dlls to the same directory.) ```bash cmake -B build -DGGML_CUDA=On -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=TRUE -DBUILD_SHARED_LIBS=TRUE cmake --build build --config Release ``` ## Unit tests The unit tests use the [Unity library](https://github.com/ThrowTheSwitch/Unity) and get compiled when building the main library bindings. You should see them as executables in the `build` folder (e.g. `build/test_predictions`). Simply running the program will execute the unit tests contained in it. By convention, the unit tests require an environment variable (WOOLY_TEST_MODEL_FILE) to be set with the path to the GGUF file for the model to use during testing. In a unix environment, that means you can do something like this to run the unit tests: ```bash export WOOLY_TEST_MODEL_FILE=models/example-llama-3-8b.gguf build/test_predictions ``` ## Git updates This project uses submodules for upstream projects so make sure to update with appropriate parameters: ```bash git pull --recurse-submodules ``` ### Developer Notes * This library creates its own version of the structures to avoid failure cases where upstream llama.cpp structs have C++ members and cannot be wrapped automatically by tooling. While inconvenient, I chose to do that over an opaque pointer and getter/setter functions. ### TODO * Reenable some advanced sampling features again like logit biases. * Missing calls to just tokenize text and to pull embeddings out from text. * Maybe a dynamic LoRA layer, trained every time enough tokens fill up the context space, approx.