同步操作将从 cryscan/web-rwkv 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
This is an inference engine for the language model of RWKV implemented in pure WebGPU.
Note that web-rwkv
is only an inference engine. It only provides the following functionalities:
run
function that takes in prompt tokens and returns logits (predicted next token probabilities after calling softmax
).It does not provide the following:
web-rwkv
.web-rwkv-axum
project if you want some fancy inference pipelines, including Classifier-Free Guidance (CFG), Backus–Naur Form (BNF) guidance, and more.web-rwkv
can run in browser) to game engines.convert_safetensors.py
. Put the .st
model under assets/models
.$ cargo run --release --example gen
$ cargo run --release --example chat
$ cargo run --release --example batch
$ cargo run --release --example chat -- --model /path/to/model
$ cargo run --release --example chat -- --prompt /path/to/prompt
assets/prompt.json
for details.--quant <LAYERS>
or --quant-nf4 <LAYERS>
to quantize the first <LAYERS>
layers. For example, use
$ cargo run --release --example chat -- --quant 32
--turbo
flag to switch to alternative GEMM
kernel when inferring long prompts.To use in your own rust project, simply add web-rwkv = "0.4"
as a dependency in your Cargo.toml
.
Check examples on how to create the environment, the tokenizer and how to run the model.
Since version v0.2.4, the engine supports batched inference, i.e., inference of a batch of prompts (with different length) in parallel.
This is achieved by a modified WKV
kernel.
When building the model, the user specifies token_chunk_size
(default: 32, but for powerful GPUs this could be much higher), which is the maximum number of tokens the engine could process in one run
call.
After creating the model, the user creates a ModelState
with max_batch
specified.
This means that there are max_batch
slots that could consume the inputs in parallel.
Before calling run()
, the user fills each slot with some tokens as prompt.
If a slot is empty, no inference will be run for it.
After calling run()
, some (but may not be all) input tokens are consumed, and logits
appears in their corresponding returned slots if the inference of that slot is finished during this run.
Since there are only token_chunk_size
tokens are processed during each run()
call, there may be none of logits
appearing in the results.
You must download the model and put in assets/models
before running if you are building from source.
You can now download the converted models here.
You may download the official RWKV World series models from HuggingFace, and convert them via the provided convert_safetensors.py
.
If you don't have python installed or don't want to, there is a pure rust converter that you can run
$ cd ./crates/web-rwkv-converter
$ cargo run --release -- --input /path/to/model.pth
"thread 'main' panicked at 'called Result::unwrap()
on an Err
value: HeaderTooLarge'"
Your model is broken, mainly because you cloned the repo but did not set up git-lfs.Please download the model manually and overwrite that one in assets/models
.
"thread 'main' panicked at 'Error in Queue::submit: parent device is lost'"
Your GPU is not responding. Maybe you are running a model that is just too big for your device. If the model doesn't fit into your VRam, the driver needs to constantly swap and transfer the model parameters, causing it to be 10x slower. Try to quantize your model first.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。