Torchchat is an easy to use library for leveraging LLMs on edge devices including mobile phones and desktops.
The following steps requires you have Python 3.10 and [virtualenv]https://virtualenv.pypa.io/en/latest/installation.html) installed
# set up a virtual environment
python3 -m virtualenv .venv/torchchat
source .venv/torchchat/bin/activate
# get the code and dependencies
git clone https://github.com/pytorch/torchchat.git
cd torchchat
pip install -r requirements.txt
# ensure everything installed correctly. If this command works you'll see a welcome message and some details
python torchchat.py --help
python torchchat.py generate stories15M
That’s all there is to it! Read on to learn how to use the full power of torchchat.
For the full details on all commands and parameters run python torchchat.py --help
For supported models, torchchat can download model weights. Most models use HuggingFace as the distribution channel, so you will need to create a HuggingFace
account and install huggingface-cli
.
To install huggingface-cli
, run pip install huggingface-cli
. After installing, create a user access token as documented here. Run huggingface-cli login
, which will prompt for the newly created token. Once this is done, torchchat will be able to download model artifacts from
HuggingFace.
python torchchat.py download llama2
Designed for interactive and conversational use. In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation. This mode is typically what you see in applications aimed at simulating conversational partners or providing customer support.
For more information run python torchchat.py chat --help
Examples
# Chat with some parameters
Aimed at producing content based on specific prompts or instructions. In generate mode, the LLM focuses on creating text based on a detailed prompt or instruction. This mode is often used for generating written content like articles, stories, reports, or even creative writing like poetry.
For more information run python torchchat.py generate --help
Examples
python torchchat.py generate llama2 --device=cpu --dtype=fp16
Compiles a model for different use cases
For more information run python torchchat.py export --help
Examples
python torchchat.py export stories15M --output-pte-path=stories15m.pte
Run a chatbot in your browser that’s supported by the model you specify in the command
Examples
# Run torchchat with --browser
python torchchat.py browser --device cpu --checkpoint-path ${MODEL_PATH} --temperature 0 --num-samples 10
Running on http://127.0.0.1:5000 should be printed out on the terminal. Click the link or go to http://127.0.0.1:5000 on your browser to start interacting with it.
Enter some text in the input box, then hit the enter key or click the “SEND” button. After 1 second or 2, the text you entered together with the generated text will be displayed. Repeat to have a conversation.
Uses lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.l
For more information run python torchchat.py eval --help
Examples Eager mode:
# Eval example for Mac with some parameters
python -m torchchat.py eval --device cuda --checkpoint-path ${MODEL_PATH} -d fp32 --limit 5
To test the perplexity for lowered or quantized model, pass it in the same way you would to generate.py:
python3 -m torchchat.py eval --pte <pte> -p <params.json> -t <tokenizer.model> --limit 5
These are the supported models
See the documentation on GGUF to learn how to use GGUF files.
Examples
#Llama3
#Stories
#CodeLama
AOT compiles models into machine code before execution, enhancing performance and predictability. It's particularly beneficial for frequently used models or those requiring quick start times. AOTI also increases security by not exposing the model at runtime. However, it may lead to larger binary sizes and lacks the runtime optimization flexibility
Examples The following example uses the Stories15M model.
TODO: Update after the CLI gets fixed. Use real paths so user can copy paste
# Compile
python torchchat export --checkpoint-path ${MODEL_PATH} --device {cuda,cpu} --output-dso-path ${MODEL_OUT}/${MODEL_NAME}.so
# Execute
python torchchat generate --device {cuda,cpu} --dso-path ${MODEL_OUT}/${MODEL_NAME}.so --prompt "Hello my name is"
NOTE: The exported model will be large. We suggest you quantize the model, explained further down, before deploying the model for use.
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device
If you want to deploy and execute a model within your iOS app If you want to deploy and execute a model within your Android app If you want to deploy and execute a model within your edge device If you want to experiment with our sample apps. Check out our iOS and Android sample apps.
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy.
TODO:
Read the Quantization documention for more details.
Prerequisites
Install ExecuTorch
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。