代码拉取完成,页面将自动刷新
同步操作将从 Ascend/MindSpeed 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
在模型预处理数据集的阶段报如下错误:
Traceback (most recent call last):
File "pretrain_gpt.py", line 121, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'}
File "/home/ma-user/modelarts/user-job-dir/GPT-3-kernel_ID2728_for_PyTorch_zgcl/megatron/training.py", line 150, in pretrain
process_non_loss_data_func)
File "/home/ma-user/modelarts/user-job-dir/GPT-3-kernel_ID2728_for_PyTorch_zgcl/megatron/training.py", line 689, in train
opt_param_scheduler)
File "/home/ma-user/modelarts/user-job-dir/GPT-3-kernel_ID2728_for_PyTorch_zgcl/megatron/training.py", line 417, in train_step
optimizer, fwd_bwd_timers, forward_only=False)
File "/home/ma-user/modelarts/user-job-dir/GPT-3-kernel_ID2728_for_PyTorch_zgcl/megatron/schedules.py", line 654, in forward_backward_pipelining_without_interleaving
timers, collect_non_loss_data)
File "/home/ma-user/modelarts/user-job-dir/GPT-3-kernel_ID2728_for_PyTorch_zgcl/megatron/schedules.py", line 118, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "pretrain_gpt.py", line 84, in forward_step
data_iterator)
File "pretrain_gpt.py", line 45, in get_batch
data = next(data_iterator)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 157, in default_collate
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 157, in <dictcomp>
return elem_type({key: default_collate([d[key] for d in batch]) for key in elem})
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 146, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/home/ma-user/anaconda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 138, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [8193] at entry 0 and [8246] at entry 1
在 megatron/core/datasets/helpers.cpp
文件里的 build_sample_idx()
函数中创建了 sample_idx
的 int32 数组去记录每个 sample 的 index,
而每个 sample 的 index 又是以 doc_idx_index
这个 int64 的变量去计算,在 sample_idx[2 * sample_index] = doc_idx_index;
这个赋值操作中存在溢出的可能。
在数据集中的句子较短,而要求训练的步数 * Global Batch Size * Sequence Length 较大的情况下就会出现 doc_idx_index
超过 int32 的表达范围而导致最终的 index 溢出。
此问题为 Megatron-LM 原生问题,CPP 代码难以通过 monkey patch 的方式进行修改。已多次提交修复 PR,但似乎 Megatron-LM 较为封闭,无人管理且不接受来自社区的代码提交。
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。