# zero-bubble-pipeline-parallelism **Repository Path**: li-baokui/zero-bubble-pipeline-parallelism ## Basic Information - **Project Name**: zero-bubble-pipeline-parallelism - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-17 - **Last Updated**: 2024-07-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README This repository is a fork of [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/). The original README can be found [here](Megatron.md). # Zero Bubble Pipeline Parallelism Zero Bubble Pipeline Parallelism is a novel pipeline parallelism algorithm able to reduce the bubble of pipeline parallelism to almost zero while preserving synchronous semantics. Check out our paper at: * [Arxiv Version with ZBV](https://arxiv.org/abs/2401.10241) * [ICLR Accepted version with ZB1P and ZB2P](https://openreview.net/pdf?id=tuzTN0eIO5) A playground for zero bubble schedulers: [Zero Bubble Pipeline Parallelism Scheduler Playground](https://huggingface.co/spaces/sail/zero-bubble-pipeline-parallellism) **Quick settings to enable Zero Bubble:** ``` --zero-bubble-v-schedule --allow-padding-num-layers --enable-optimizer-post-validation ``` Can also try out with `ZERO_BUBBLE_V_SCHEDULE=1 examples/pretrain_zero_bubble.sh` **Light-weight alternative options to enable ZB H1 schedule for your own megatron fork** * Option 1: Patch a tiny ~40 line patch to your repository as described in [zb-h1-quick-start](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/blob/zb-h1-quick-start/README.md) * Option 2: Install our pre-built zbpp packages and enable it in your own training scripts (E.g. `pretrain_gpt.py`) ``` # installed by pip install zbpp_light import zbpp_light zbpp_light.patch_megatron() import megatron ... ``` **Acceleration** Experiments shows zero bubble pipeline parallelism can accelerate training up to 30% with a similar memory comsumption. A detailed table of experiments is coming soon. **Notices** * ZBV schedule requires the number of layers per pipeline to be an even number, so that each stage can be splited into two virtual stages evenly. * To achieve a better throughput, we recommend setting `--num-layers` to a value to `k * pipeline-model-parallel-size - 2` where k can be any value $\ge1$. This is used to compensate for the additional embedding layer on the first/last pipeline stages which could otherwise brings bubble to all other stages. ## Zero Bubble Schedules The key of achieving zero bubble is to breaking a backward pass into a $B$ pass and $W$ pass. $B$ on one stage will only depend on the $B$ on its next stage, compared to depending on both $B$ and $W$ of in 1F1B. ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/0ab6f76c-1cf0-4962-a664-124fcb3886d6) ### Comparision of Schedules * 1F1B ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/1658cba3-7fef-4c41-a227-69c6b4581f50) * ZB1P ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/876bd529-c454-41ab-ad85-30dfb5e1c8fa) * ZB2P ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/373f6a27-6a7d-4a0e-92cb-a581c2c13cd5) * ZBV - Each device is assigned to exactly 2 chunks (virtual stages), where white text colors represent the first chunk and black text colors represent the second chunk. The sequence of dependencies among model chunks follows a ā€Vā€ shape pattern for both the forward and backward passes. ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/1e9490a9-e593-4bda-833e-8babbaea045b) | | 1F1B | ZB1P | ZB2P | ZBV (Recommended) | | ----------------------------------------------------- | ------- | -------- | ---- | --- | | Bubble Rate | $(p-1)/m$ | $(p-1)/3m$ | 0 | 0 | | Activation Memory
(Compared to 1F1B) | 1x | 1x | 2x | 1x | | Pipeline Communication Volume
(Compared to 1F1B) | 1x | 1x | 1x | 2x |

* p: number of pipeline stages; m: number of microbatches

* Assuming TF = TB = TW

* Communication volume of DP and TP stays the same

## Zero Bubble Command Line Arguments * `--enable-zero-bubble` Enables zero bubble schedules. * `--zero-bubble-v-schedule` Enables ZBV schedule recommended above. Implies `--enable-zero-bubble`. * `--enable-optimizer-post-validation` Enables optimizer post validation explained in [Optimizer Post Validation](#Optimizer-Post-Validation) * `--allow-padding-num-layers` Allowing the number of layers to NOT be a mutiple of number of Pipelines. This allows us to have one less layer on the first and last pipeline stage to compensate for the bubble caused by embedding layers. * `--zero-bubble-max-pending-backward` Controls memory limit of zero bubble schedules. Setting this to 1 x number of pipelines will get a schedule like ZB1P while setting to 2x number of pipelines will get ZB2P. No effect for ZBV schedule enabled by `--zero-bubble-v-schedule`. * `--zero-bubble-pipeline-timers-start-iter` and `--zero-bubble-pipeline-timers-end-iter` Used to control the start/end iterations when ZB scheduler profiles each F/B/W to measure $T_F$, $T_B$ and $T_W$ ## Optimizer Post Validation In most practices of PP there's an all-reduce cross all pipeline stages for numerical robustness, e.g. global gradient norm for gradient clipping. INF/NAN check for mixed precision training, etc. This all-reduce breaks parallelogram and makes zero bubble impossible. Under the observation that during a stable training both the gradient clipping and INF/NAN rarely triggers, we replace the before-hand synchronizations with a post update validation. ![image](https://github.com/sail-sg/zero-bubble-pipeline-parallelism/assets/2740430/40be4651-7240-4962-bd2a-246557752768) We eagerly step the optimizers assuming the grad cliping, INF/NAN conditions are not triggered. In case an amendment to the gradient is required, a rollback will be issued and then we redo the optimizer step based on the fully reduced global state. To enable this feature, add `--enable-optimizer-post-validation`. Experiments shows NOT enabling this will cause ~8% performance loss.