For a new model using Sharding Propagation
to configure the parallelization strategy, the key issue is to configure which operator's slicing strategy will yield better performance. Since the goal of strategy propagation is to minimize the cost of tensor rearranging rather than minimizing the end-to-end iteration time, it is important to configure the appropriate cut strategy for the "key operators". However, there is no explicit rule governing which operators must be configured with a sharding strategy. Nevertheless, based on our experience in training large models, there are some principles that can be used to guide new users in configuring parallel strategies. Here, we list 4 empirical principles.
The sharding strategy for parameter weights is very important, especially for large models, as the memory consumption caused by parameter weights accounts for a large portion of the total memory consumption for model training. Therefore, operators involving weights usually need to explicitly configure the sharding strategy. In the two examples below, the Gather and MatMul operators involving weights are configured with sharding strategy, while the other operators are not. These correspond the data-parallel VocabEmbedding layer and hybrid-parallel FeedForward Layer in MindFormers, respectively.
The operators of deep learning frameworks can be broadly categorized into two types: operators that are semantically simple and dimension-preserving and operators that change the dimension of the input tensor. For dimension-preserving operators, the strategy propagation algorithm can propagate the sharding strategy more easily. However, for dimension-changing operators, explicitly configuring the sharding strategy is the only way to better express the user initial thoughts and avoid the strategy propagation algorithm from deriving the sharding strategy that is not expected by the user. Common dimension-changing and axis-changing operators are: ReduceMean, ReduceSum, Transpose, StridedSlice, MatMul, and BatchMatMul. In the example below, ReduceMean and MatMul are dimension-changing operators that are configured with sharding strategy.
For ResNet-like models, different parts of the model have different preferred parallel: the first half uses data parallel, and the second half uses model parallel for optimal iterative performance. For Llama-like large models, when vocab_size is too large, model parallel slicing may be chosen for memory considerations; when sequence_length is too large, the strategy of sequence parallelism may also be chosen. The above strategies belong to those carefully configured by the user based on the model and hardware information.Sharding Propagation is a plain algorithm to find the least cost of rearrangement, and it does not find the carefully configured strategies automatically, so for the operator strategies carefully tuned by the user, it is necessary to configure them exclusively. In the example below, the first MatMul is configured with a strategy for data parallel, which will propagate the strategy for data parallel forward to the first half of the model, while the second MatMul is configured with a strategy for model parallel, which will propagate the strategy for model parallel backward to the second half of the model.
Fusion large operators, such as FlashAttentionScore, rms_norm, are also operators that require the user to manually configure the strategy. The input and output logic of the fusion operator is relatively complex, and the propagated strategy without reordering is not necessarily the strategy expected by the user. These operators also require explicit configuration of the operator-level strategy.
Users working with strategy propagation need to have some understanding not only of its propagation algorithm itself, but also of the parallelism of the model to be trained. If there exists a certain operator whose parallelization strategy determined by the strategy propagation algorithm does not meet the user's expectations, that can always be solved by configuring an additional operator parallelization strategy. In practice, for a new model, it does take several attempts to obtain an overall parallel configuration with better performance.
Taking the encapsulated class RowParallelLinear in MindFormers as an example:
|
|
The other example is CoreAttention. Configure it as above:
|
|
Check the example of FlashAttention:
|
|
If classes that are open source and already paired with a strategy in MindFormers are used directly, the external network does not need to configure the shard strategy for the operator again, e.g., LlamaForCausalLM.
|
|
When the user cannot confirm whether the operator needs to be configured with a strategy, it can be left unconfigured and the algorithm will propagate to find the optimal strategy, but it may not be able to obtain the best parallel results. If the user can confirm what strategy needs to be configured for the operator, it can be configured to help the algorithm obtain the desired results.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。