A fast, affordable, scalable and open system framework for enabling end-to-end Reinforcement Learning Human Feedback (RLHF) training experience to generate high-quality ChatGPT-style models at all scales.
In the spirit of democratizing ChatGPT-style models and their capabilities, DeepSpeed is proud to introduce a general system framework for enabling an end-to-end training experience for ChatGPT-like models, named DeepSpeed Chat. It can automatically take your favorite pre-trained large language models though an OpenAI InstructGPT style three stages to produce your very own high-quality ChatGPT-style model. DeepSpeed Chat makes training for high-quality ChatGPT-style models easy, fast, affordable and scalable.
With just one click, you can train, generate and serve a 1.3 billion parameter ChatGPT model within 1.36 hours on a single consumer-grade NVIDIA A6000 GPU with 48GB memory. On a single DGX node with 8 NVIDIA A100-40G GPUs, DeepSpeed-Chat enables training for a 13 billion parameter ChatGPT model in 13.6 hours. On multi-GPU multi-node systems (cloud scenarios),i.e., 8 DGX nodes with 8 NVIDIA A100 GPUs/node, DeepSpeed-Chat can train a 66 billion parameter ChatGPT model under 9 hours. Finally, it enables 15X faster training over the existing RLHF systems, and can handle training of ChatGPT-like models with over 200 billion parameters: another impossible feat with the existing systems. For the full range of discussion on various model sizes and low training cost enabled by DeepSpeed-Chat, please refer to the Release Blog and Training Performance Evaluation.
Beyond this release, DeepSpeed system has been proudly serving as the system backend for accelerating a range of on-going efforts for fast training/fine-tuning Chat-Style models (e.g., LLaMA). The following are some of the open-source examples that are powered by DeepSpeed:
DeepSpeed Chat is evolving fast to accommodate the increasing demand for system-level acceleration support for training/finetuning as well as serving emerging models. Please stay tuned with our upcoming milestones at Roadmap.
A summary of DeepSpeed Chat includes:
Please check out our Blog Release and Documentation and Tutorial for more details on our training methodology and new system technologies.
pip install deepspeed>=0.9.0
git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/
pip install -r requirements.txt
If you only have around **1-2 hour** for coffee or lunch break, you can also try to train a small/toy model with DeepSpeed-Chat. For example, we prepared a training example for a **1.3B** model with a single dataset to test our framework on your consumer-grade GPUs. The best part is that you will have your model checkpoint ready to play with when you are back from your lunch break!
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
See the following table for the E2E time breakdown for training a 1.3 billion parameter ChatGPT model via DeepSpeed-Chat on a single commodity NVIDIA A6000 GPU with 48GB memory.
Model Size (A6000-48G) | Step 1 | Step 2 | Step 3 | Total |
---|---|---|---|---|
Actor: OPT-1.3B Reward: OPT-350M | 2900 Sec | 670 Sec | 1.2hr | 2.2hr |
If you only have around half a day and only a single server node, we suggest to use an example of pretrained OPT-13B as the actor model and OPT-350M as the reward model in the following single script to generate a final 13B ChatGPT-style model:
python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
See the following table for the E2E time breakdown for training a 13 billion parameter ChatGPT model via DeepSpeed-Chat on a single DGX node with 8 NVIDIA A100-40G GPUs.
Model Size (A100-40G) | Step 1 | Step 2 | Step 3 | Total |
---|---|---|---|---|
Actor: OPT-13B Reward: OPT-350M | 2.5hr | 0.25hr | 10.8hr | 13.6hr |
Want to try different model sizes and configurations? You got it! With DeepSpeed-Chat, users can easily do that. For example, if you have access to multi-nodes cluster or cloud resources and prefer to train a larger and higher-quality model for your research or business, you can simply use a similar script with your desired model sizes, e.g., 66B and GPU counts=64:
python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
See the following table for E2E time breakdown for training a 66 billion parameter ChatGPT model via DeepSpeed-Chat on 8 DGX nodes with 8 NVIDIA A100-80G GPUs/node.
Model Size (A100-80g) | Step 1 | Step 2 | Step 3 | Total |
---|---|---|---|---|
Actor: OPT-66B Reward: OPT-350M | 82mins | 5mins | 7.5hr | 9hr |
For your detailed budgeting for training a wide spectrum of model sizes on potentially available computing resources (e.g., Microsoft Azure), please visit Training Performance Evaluation.
The train.py script has an easy-to-use command-line interface and can be launched with several arguments including model type, model size, and number of GPUs to run. Considering users who would like to use DeepSpeed-Chat to only fine-tune their pretrained models in Step 1 or 2, or just use their own actor and reward model checkpoints directly to perform Step 3 in our RLHF pipeline, DeepSpeed-Chat provides greater configurability and flexibility to accommodate individual step fine-tuning:
# Move into the first step of the pipeline
cd training/step1_supervised_finetuning/
# Run the training script
bash training_scripts/single_gpu/run_1.3b.sh
# Evaluate the model
bash evaluation_scripts/run_prompt.sh
# Move into the second step of the pipeline
cd training/step2_reward_model_finetuning
# Run the training script
bash training_scripts/run_350m.sh
# Evaluate the model
bash evaluation_scripts/run_eval.sh
Figure 1: The illustration of DeepSpeed Chat’s RLHF training pipeline with optional features.
As the most complex step of the entire 3-step InstructGPT pipeline, DeepSpeed Chat's Hyrbid Engine has enabled sufficient acceleration to aovid large training time (cost) implications. Refer to Step3: Reinforcement Learning Human Feedback (RLHF) for more information. If you already have your fine-tuned actor and reward model checkpoints, you can simply run the following scripts to enable the PPO training.
# Move into the final step of the pipeline
cd training/step3_rlhf_finetuning/
# Run the training script
bash training_scripts/single_gpu/run_1.3b.sh
DeepSpeed-Chat allows users to build their very own RLHF training pipeline using our flexible APIs shown below, which users can use to reconstruct their own RLHF training strategy. This enables a general interface and backend for creating a wide range of RLHF algorithms for research exploration.
engine = DeepSpeedRLHFEngine(
actor_model_name_or_path=args.actor_model_name_or_path,
critic_model_name_or_path=args.critic_model_name_or_path,
tokenizer=tokenizer,
num_total_iters=num_total_iters,
args=args)
trainer = DeepSpeedPPOTrainer(engine=engine, args=args)
for prompt_batch in prompt_train_dataloader:
out = trainer.generate_experience(prompt_batch)
actor_loss, critic_loss = trainer.train_rlhf(out)
For quickly testing your final models trained by DeepSpeed-Chat, we provide a simple script below. For users who want to use our trained models to create different LLM applications such as personal assistant, chatbot and code understanding, please refer to LangChain.
# serve the final model
python chat.py --path ${PATH-to-your-actor-model}
Example 1: Q&A Session from serving a 1.3B final model trained from DeepSpeed-Chat
Example 2: Multi-Round Conversations from serving a model trained from DeepSpeed-Chat
A comprehensive view of the scale and end-to-end training times enabled by DeepSpeed-RLHF system are presented in Table 1. It also demonstrates the most cost-effective way to train models in Azure Cloud along with the associated cost.
GPU SKUs | OPT-1.3B | OPT-6.7B | OPT-13.2B | OPT-30B | OPT-66B | Bloom-175B |
---|---|---|---|---|---|---|
1x V100 32G | 1.8 days | |||||
1x A6000 48G | 1.1 days | 5.6 days | ||||
1x A100 40G | 15.4 hrs | 3.4 days | ||||
1x A100 80G | 11.7 hrs | 1.7 days | 4.9 days | |||
8x A100 40G | 2 hrs | 5.7 hrs | 10.8 hrs | 1.85 days | ||
8x A100 80G | 1.4 hrs($45) | 4.1 hrs ($132) | 9 hrs ($290) | 18 hrs ($580) | 2.1 days ($1620) | |
64x A100 80G | 31 minutes | 51 minutes | 1.25 hrs ($320) | 4 hrs ($1024) | 7.5 hrs ($1920) | 20 hrs ($5120) |
Table 1. End-to-end RLHF training (Step 3) for different actor model sizes and a fixed 350M critical model running on hardware ranging from single consumer grade GPU (NVIDIA A6000) to a more powerful cloud setup (64xA100-80GPUs).
The numbers in the table above are for Stage 3 of the training and based on actual measured training throughput on DeepSpeed-RLHF curated dataset and training recipe which trains for one epoch on a total of 135M tokens (6 open-sourced datasets with 40% used for RLHF training stage, i.e., Dahoas/rm-static, Dahoas/full-hh-rlhf, Dahoas/synthetic-instruct-gptj-pairwise, yitingxie/rlhf-reward-datasets, openai/webgpt_comparisons, and stanfordnlp/SHP from Huggingface Datasets. More specifically, we have in total 67.5M query tokens (131.9k queries with sequence length 256) and 67.5M generated tokens (131.9k answers with sequence length 256), and a maximum global batch size per step of 0.5M tokens (1024 query-answer pairs). We urge readers to pay attention to these specifications before making any cost and e2e time comparisons with DeepSpeed-RLHF.
(I) Single-GPU's Model Scale and Throughput Comparision
With over an order of magnitude higher throughput, DeepSpeed-Chat unlocks the ability to train significantly larger actor models under the same latency budget or train models of similar size at much lower cost, compared to the existing systems like Colossal-AI or HuggingFace-DDP. For example, on a single GPU, DeepSpeed enables over 10X throughput improvement for RLHF training on a single GPU. While both CAI-Coati and HF-DDP can run a max model size of 1.3B, DeepSpeed can run 6.5B model on the same hardware, 5x higher.
Figure 2: Step 3 throughput comparison against two other system frameworks (Colossal AI's Coati and Huggingface-DDP) for accelerating RLHF training on a single NVIDIA A100-40G commodity GPU. No icons represent OOM scenarios.
(II) Single-Node Multi-GPU Model Scale and Throughput Comparision
On multi-GPUs of a single node, DeepSpeed-Chat enables 6-19X speedup over CAI-Coati and 1.4-10.5X speedup over HF-DDP (Figure 3) with respect to system throughput.
Figure 3. End-to-end training throughput comparison for step 3 of the training pipeline (the most time consuming portion) with different model sizes on a single DGX node equipped with 8 NVIDIA A100-40G GPUs. No icons represent OOM scenarios.
(III) Superior Generation Phase Acceleration in Step3
One of the key reasons that result in Figure 3 is our Hyrbid Engine's superior generation phase acceleration, shown below.
Figure 4. Superior generation phase acceleration from DeepSpeed Chat’s Hybrid Engine: A time/sequence breakdown for training OPT-1.3B actor model + OPT-350M reward model on a single DGX node with 8 A100-40G GPUs.
For other detailed results and in-depth analysis, including effective throughput and scalability of DeepSpeed-Chat, please refer to our Blog Post.
Currently, we support the following model families. We will continue to grow over time to including emerging models for ChatGPT-style training! See Roadmap for more details.
model family | size range |
---|---|
opt | 0.1B - 66B |
bloom | 0.3B - 176B |
gpt_neox | 1.3B - 20B |
gptj | 1.4B - 6B |
gpt_neo | 0.1B - 2.7B |
gpt2 | 0.3B - 1.5B |
codegen | 0.35b - 16B |
For more APIs, example scripts, and evaluation results, please refer to
Our future plan includes but not limited to :
Just like how the success of the BLOOM model was supported by both DeepSpeed Team and many open source contributors, we welcome all AI developers/practitioners/researchers to join this on-going effort for DeepSpeed-Chat. To participate:
We thank the following papers and open-source repositories:
[1] Schulman, John, et al. "Introducing ChatGPT", https://openai.com/blog/chatgpt (2022).
[2] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022). This is also referred as InstructGPT
[3] Stiennon, Nisan, et al. "Learning to summarise with human feedback." Advances in Neural Information Processing Systems 33 (2020): 3008-3021.
[4] Transformers Hugging Face (github.com)
[5] CarperAI, https://github.com/CarperAI/trlx
[6] lvwerra/trl: Train transformer language models with reinforcement learning. (github.com)
[7] pg-is-all-you-need/02.PPO.ipynb at master · MrSyee/pg-is-all-you-need (github.com)
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。