1 Star 0 Fork 5

lineCodeJm / web-rwkv

forked from cryscan / web-rwkv 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

Web-RWKV

crates.io docs.rs

This is an inference engine for the language model of RWKV implemented in pure WebGPU.

Features

  • No dependencies on CUDA/Python.
  • Support Nvidia/AMD/Intel GPUs, including integrated GPUs.
  • Vulkan/Dx12/OpenGL backends.
  • Batched inference.
  • Int8 and NF4 quantization.
  • Very fast.
  • LoRA merging at loading time.
  • Support RWKV V4, V5 and V6.

Note that web-rwkv is only an inference engine. It only provides the following functionalities:

  • A tokenizer.
  • Model loading.
  • State creation and updating.
  • A run function that takes in prompt tokens and returns logits (predicted next token probabilities after calling softmax).

It does not provide the following:

  • OpenAI API or APIs of any kind.
    • If you would like to deploy an API server, check AI00 RWKV Server which is a fully-functional OpenAI-compatible API server built upon web-rwkv.
    • You could also check the web-rwkv-axum project if you want some fancy inference pipelines, including Classifier-Free Guidance (CFG), Backus–Naur Form (BNF) guidance, and more.
  • Samplers, though in the examples a basic nucleus sampler is implemented, this is not included in the library itself.
  • State caching or management system.
  • Python (or any other languages) binding.
  • Runtime. Without a runtime makes it easy to be integrated into any applications from servers, front-end apps (yes, web-rwkv can run in browser) to game engines.

Compile and Run

  1. Install Rust.
  2. Download the model from HuggingFace, and convert it using convert_safetensors.py. Put the .st model under assets/models.
  3. To generate 100 tokens and measure the time cost, run
    $ cargo run --release --example gen
  4. To chat with the model, run
    $ cargo run --release --example chat
  5. To generate 4 batches of text with various lengths simultaneously, run
    $ cargo run --release --example batch
  6. To specify the location of your safetensors model, use
    $ cargo run --release --example chat -- --model /path/to/model
  7. To load custom prompts for chat, use
    $ cargo run --release --example chat -- --prompt /path/to/prompt
    See assets/prompt.json for details.
  8. To specify layer quantization, use --quant <LAYERS> or --quant-nf4 <LAYERS> to quantize the first <LAYERS> layers. For example, use
    $ cargo run --release --example chat -- --quant 32
    to quantize all 32 layers.
  9. Use --turbo flag to switch to alternative GEMM kernel when inferring long prompts.

Use in Your Project

To use in your own rust project, simply add web-rwkv = "0.4" as a dependency in your Cargo.toml. Check examples on how to create the environment, the tokenizer and how to run the model.

Explanation of Batched Inference

Since version v0.2.4, the engine supports batched inference, i.e., inference of a batch of prompts (with different length) in parallel. This is achieved by a modified WKV kernel.

When building the model, the user specifies token_chunk_size (default: 32, but for powerful GPUs this could be much higher), which is the maximum number of tokens the engine could process in one run call.

After creating the model, the user creates a ModelState with max_batch specified. This means that there are max_batch slots that could consume the inputs in parallel.

Before calling run(), the user fills each slot with some tokens as prompt. If a slot is empty, no inference will be run for it.

After calling run(), some (but may not be all) input tokens are consumed, and logits appears in their corresponding returned slots if the inference of that slot is finished during this run. Since there are only token_chunk_size tokens are processed during each run() call, there may be none of logits appearing in the results.

Convert Models

You must download the model and put in assets/models before running if you are building from source. You can now download the converted models here.

You may download the official RWKV World series models from HuggingFace, and convert them via the provided convert_safetensors.py.

If you don't have python installed or don't want to, there is a pure rust converter that you can run

$ cd ./crates/web-rwkv-converter
$ cargo run --release -- --input /path/to/model.pth

Troubleshoot

  • "thread 'main' panicked at 'called Result::unwrap() on an Err value: HeaderTooLarge'"

    Your model is broken, mainly because you cloned the repo but did not set up git-lfs.Please download the model manually and overwrite that one in assets/models.

  • "thread 'main' panicked at 'Error in Queue::submit: parent device is lost'"

    Your GPU is not responding. Maybe you are running a model that is just too big for your device. If the model doesn't fit into your VRam, the driver needs to constantly swap and transfer the model parameters, causing it to be 10x slower. Try to quantize your model first.

Credits

  • Tokenizer is implemented by @koute.
This repository is dual-licensed under either * MIT License (docs/LICENSE-MIT or http://opensource.org/licenses/MIT) * Apache License, Version 2.0 (docs/LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0) at your option.

简介

Implementation of the RWKV language model in pure WebGPU/Rust. 展开 收起
Rust
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Rust
1
https://gitee.com/lineCodeJm/web-rwkv.git
git@gitee.com:lineCodeJm/web-rwkv.git
lineCodeJm
web-rwkv
web-rwkv
main

搜索帮助

344bd9b3 5694891 D2dac590 5694891