# xet-core **Repository Path**: mirrors_huggingface/xet-core ## Basic Information - **Project Name**: xet-core - **Description**: xet client tech, used in huggingface_hub - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-16 - **Last Updated**: 2026-01-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

License GitHub release Contributor Covenant

๐Ÿค— xet-core - xet client tech, used in huggingface_hub

## Welcome xet-core enables huggingface_hub to utilize xet storage for uploading and downloading to HF Hub. Xet storage provides chunk-based deduplication, efficient storage/retrieval with local disk caching, and backwards compatibility with Git LFS. This library is not meant to be used directly, and is instead intended to be used from [huggingface_hub](https://pypi.org/project/huggingface-hub). ## Key features โ™ป **chunk-based deduplication implementation**: avoid transferring and storing chunks that are shared across binary files (models, datasets, etc). ๐Ÿค— **Python bindings**: bindings for [huggingface_hub](https://github.com/huggingface/huggingface_hub/) package. โ†” **network communications**: concurrent communication to HF Hub Xet backend services (CAS). ๐Ÿ”– **local disk caching**: chunk-based cache that sits alongside the existing [huggingface_hub disk cache](https://huggingface.co/docs/huggingface_hub/guides/manage-cache). ## Contributions (feature requests, bugs, etc.) are encouraged & appreciated ๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿงกโค๏ธ Please join us in making xet-core better. We value everyone's contributions. Code is not the only way to help. Answering questions, helping each other, improving documentation, filing issues all help immensely. If you are interested in contributing (please do!), check out the [contribution guide](https://github.com/huggingface/xet-core/blob/main/CONTRIBUTING.md) for this repository. ## Issues, Diagnostics & Debugging If you encounter an issue when using `hf-xet` please help us fix the issue by collecting diagnostic information and attaching that when creating a [new Issue](https://github.com/huggingface/xet-core/issues/new/choose). Download the [hf-xet-diag-linux.sh](hf-xet-diag-linux.sh), [hf-xet-diag-macos.sh](hf-xet-diag-macos.sh), or [hf-xet-diag-windows.sh](hf-xet-diag-windows.sh) script based on your operating system and then re-run the python command that resulted in the issue. The diagnostic scripts will download and install debug symbols, setup up logging, and take periodic stack traces throughout process execution in a diagnostics directory that is easy to analyze, package, and upload. ### Diagnostics - Linux (`hf-xet-diag-linux.sh`) * Uses `gdb` + `gcore` to periodically snapshot stacks and produce core dumps. * Supports optional ptrace preload helper for debugging. * Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically. **Requirements:** ```bash sudo apt-get install gdb build-essential ``` **Example usage:** ```bash ./hf-xet-diag-linux.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct" ``` ### Windows (Git-Bash) (`hf-xet-diag-windows.sh`) * Runs in **Git-Bash**, keeping usage consistent with Linux. * Uses **Sysinternals ProcDump** for periodic mini dumps (`-mp`). * Auto-downloads `procdump.exe` if not found. * Downloads and installs the matching `hf_xet.pdb` debug symbol into the package directory. **Requirements:** * Git-Bash (from [Git for Windows](https://gitforwindows.org/)) * Python installed * Internet access (first run downloads ProcDump and debug symbols) **Example usage:** ```bash ./hf-xet-diag-windows.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct" ``` ### Diagnostics - MacOS (`hf-xet-diag-macos.sh`) * Uses `sample` + `lldb` to periodically snapshot stacks and produce core dumps. * Downloads and installs the appropriate `hf_xet-*.dbg` symbol file automatically. **Requirements:** ```bash sudo xcode-select --install ``` **Example usage:** ```bash ./hf-xet-diag-macos.sh -- python hf-download.py "Qwen/Qwen2.5-VL-3B-Instruct" ``` --- ### Output Layout The diagnostic scripts produce a diagnostics directory named: ``` diag__/ โ”œโ”€โ”€ console.log # Combined stdout/stderr of the process โ”œโ”€โ”€ env.log # System/environment info โ”œโ”€โ”€ pid # Child PID file โ”œโ”€โ”€ stacks/ # Periodic stack traces / dumps โ””โ”€โ”€ dumps/ # (Linux only) full gcore dumps ``` This unified layout makes it easier to compare diagnostics across platforms. --- ### Analyzing Dumps Use the [hf-xet-diag-analyze-latest.sh](hf-xet-diag-analyze-latest.sh) script to automatically find and open the most recent dump in the appropriate debugger for your platform. **Usage:** ```bash ./hf-xet-diag-analyze-latest.sh ``` * Auto-detects your OS (Linux, macOS, or Windows) * Finds the most recent `diag_*` directory * Opens the latest dump in the platform-appropriate debugger: * **Linux:** `gdb` with core dumps from `dumps/` * **macOS:** `lldb` with `.core` files from `dumps/` * **Windows (Git-Bash):** `windbg` with `.dmp` files from `stacks/` You can also specify a diagnostics directory: ```bash ./hf-xet-diag-analyze-latest.sh diag_python_hfxet_test_20250127120000 ``` **Manual Analysis** If you prefer to analyze dumps manually: **Linux** * Stack traces: `stacks/*.txt` (plain text, captured periodically) * Core dumps: `dumps/core_*` * Analysis: ```bash gdb python dumps/core_. (gdb) bt # backtrace of current thread (gdb) thread apply all bt # backtrace of all threads (gdb) info threads # list all threads ``` * Ensure debug symbols (`hf_xet-*.so.dbg`) are in the `hf_xet` package directory **macOS** * Stack traces: `stacks/*.txt` (from `sample` command) * Core dumps: `dumps/dump__.core` * Analysis: ```bash lldb -c dumps/dump__.core python3 (lldb) bt # backtrace of current thread (lldb) thread backtrace all # backtrace of all threads (lldb) thread list # list all threads ``` * Ensure debug symbols (`hf_xet-*.dylib.dSYM`) are in the `hf_xet` package directory **Windows** * Dumps: `stacks/dump_.dmp` * Install [WinDbg via Windows SDK](https://developer.microsoft.com/en-us/windows/downloads/windows-sdk/) * Analysis: ```cmd windbg -z stacks\dump_.dmp ``` * Common WinDbg commands: ``` !analyze -v # automatic analysis ~* kb # backtrace of all threads ~ # list all threads lm # list loaded modules (verify hf_xet.pdb loaded) ``` * Ensure debug symbols (`hf_xet.pdb`) are in the `hf_xet` package directory --- โš ๏ธ **Tip:** Share the full `diag__/` directory when reporting issues โ€” it contains logs, environment info, and dumps needed to reproduce and diagnose problems. ### Debugging To limit the size our our built binaries, we are releasing python wheels with binaries that are stripped of debugging symbols. If you encounter a panic while running hf-xet, you can use the debug symbols to help identify the part of the library that failed. Here are the recommended steps: 1. Download and unzip our [debug symbols package](https://github.com/huggingface/xet-core/releases/download/latest/dbg-symbols.zip). 2. Determine the location of the hf-xet package using `pip show hf-xet`. The `Location` field will show the location of all the site packages. The `hf_xet` package will be within that directory. 3. Determine the symbols to copy based on the system you are running: * Windows: use `hf_xet.pdb` * Mac: use `libhf_xet-macosx-x86_64.dylib.dSYM` for Intel based Macs and `libhf_xet-macosx-aarch64.dylib.dSYM` for Apple Silicon. * Linux: the choice will depend on the architecture and wheel distribution used. To get this information, `cat` the `WHEEL` file name within the `hf_xet.dist-info` directory in your site packages. The wheel file will have the linux build and architecture in the file name. Eg: `cat /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet-*.dist-info/WHEEL`. You will use the file named `hf_xet--.abi3.so.dbg` choosing the distribution and platform that matches your wheel. Eg: `hf_xet-manylinux-x86_64.abi3.so.dbg`. 4. Copy the symbols to the site package path from step 2 above + `hf_xet`. Eg: `cp -r hf_xet-1.1.2-manylinux-x86_64.abi3.so.dbg /home/ubuntu/.venv/lib/python3.12/site-packages/hf_xet` 5. Run your python binary with `RUST_BACKTRACE=full` and recreate your failure. #### Debugging Environment Variables To enable logging and see more debugging / diagnostics information, set the following: ``` RUST_BACKTRACE=full RUST_LOG=info HF_XET_LOG_FILE=/tmp/xet.log ``` Note: HF_XET_LOG_FILE expects a full writable path. If one isn't found it will use stdout console for logging. ## Local Development ### Repo Organization - Rust Crates * [cas_client](./cas_client): communication with CAS backend services, which include APIs for Xorbs and Shards. * [cas_object](./cas_object): CAS object (Xorb) format and associated APIs, including chunks (ranges within Xorbs). * [cas_types](./cas_types): common types shared across crates in xet-core and xetcas. * [chunk_cache](./chunk_cache): local disk cache of Xorb chunks. * [chunk_cache_bench](./chunk_cache_bench): benchmarking crate for chunk_cache. * [data](./data): main driver for client operations - FilePointerTranslator drives hydrating or shrinking files, chunking + deduplication here. * [error_printer](./error_printer): utility for printing errors conveniently. * [file_utils](./file_utils): SafeFileCreator utility, used by chunk_cache. * [hf_xet](./hf_xet): Python integration with Rust code, uses maturin to build `hf-xet` Python package. Main integration with HF Hub Python package. * [mdb_shard](./mdb_shard): Shard operations, including Shard format, dedupe probing, benchmarks, and utilities. * [merklehash](./merklehash): MerkleHash type, 256-bit hash, widely used across many crates. * [progress_reporting](./progress_reporting): offers ReportedWriter so progress for Writer operations can be displayed. * [utils](./utils): general utilities, including singleflight, progress, serialization_utils and threadpool. ### Build, Test & Benchmark To build xet-core, look at requirements in [GitHub Actions CI Workflow](.github/workflows/ci.yml) for the Rust toolchain to install. Follow Rust documentation for installing rustup and that version of the toolchain. Use the following steps for building, testing, benchmarking. Many of us on the team use [VSCode](https://code.visualstudio.com/), so we have checked in some settings in the .vscode directory. Install the rust-analyzer extension. Build: ``` cargo build ``` Test: ``` cargo test ``` Benchmark: ``` cargo bench ``` Linting: ``` cargo clippy -r --verbose -- -D warnings ``` Formatting (requires nightly toolchain): ``` cargo +nightly fmt --manifest-path ./Cargo.toml --all ``` ### Building Python package and running locally (on *nix systems): 1. Create Python3 virtualenv: `python3 -mvenv ~/venv` 2. Activate virtualenv: `source ~/venv/bin/activate` 3. Install maturin: `pip3 install maturin ipython` 4. Go to hf_xet crate: `cd hf_xet` 5. Build: `maturin develop` 6. Test: ``` ipython import hf_xet as hfxet hfxet.upload_files() hfxet.download_files() ``` #### Developing with tokio console > Prerequisite is installing tokio-console (`cargo install tokio-console`). See [https://github.com/tokio-rs/console](https://github.com/tokio-rs/console) To use tokio-console with hf-xet there are compile hf_xet with the following command: ```sh RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console ``` Then while hf_xet is running (via a `hf` cli command or `huggingface_hub` python code), `tokio-console` will be able to connect. ### Ex. ```bash # In one terminal: pip install huggingface_hub RUSTFLAGS="--cfg tokio_unstable" maturin develop -r --features tokio-console hf download openai/gpt-oss-20b # In another terminal cargo install tokio-console tokio-console ``` #### Building universal whl for MacOS: From hf_xet directory: ``` MACOSX_DEPLOYMENT_TARGET=10.9 maturin build --release --target universal2-apple-darwin --features openssl_vendored ``` Note: You may need to install x86_64: `rustup target add x86_64-apple-darwin` ### Testing Unit-tests are run with `cargo test`, benchmarks are run with `cargo bench`. Some crates have a main.rs that can be run for manual testing. ## References & History * [Technical Blog posts](https://xethub.com/) * [Git is for Data 'CIDR paper](https://xethub.com/blog/git-is-for-data-published-in-cidr-2023) * History: xet-core is adapted from [xet-core](https://github.com/xetdata/xet-core), which contains deep git integration, along with very different backend services implementation.