一、问题现象(附报错日志上下文):
如图,在训练过程中会提示IndexError: The shape of the mask [4097] at index 0 does not match the shape of the indexed tensor [4096] at index 0
二、软件版本:
-- CANN 版本 8.0.RC3.alpha003
--Pytorch 版本 2.1.0
--Python 版本 3.8
--操作系统版本 2.0 (SP10)
三、测试步骤:
ckpt_convert_deepseek3_hf2mcore.sh 因为只想试一下pretrain,模型又很大这一步就先跳过了。
data_convert_deepseek3_pretrain.sh 数据集照的是这个脚本,分词器下载的是https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main这里的,数据集是这里的第一条https://huggingface.co/datasets/lsb/enwiki20230101/tree/main/data。
pretrain_deepseek3_671b_4k_ptd.sh pretrain执行的是这一条。并行策略如下:
四、日志信息:
完整的日志如下:
using world size: 8, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 8, pipeline-model-parallel size: 1
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
When context_parallel is not activated, kv_head_repeat_before_uly_alltoall would be set to False for reducing memory usage.
[INFO] Setting args.create_attention_mask_in_dataloader to False since reset_data=False or alibi_without_flash_attn=False or args.tokenizer_padding_side=right
------------------------ MindSpeed-LLM Arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adaptive_cp_dynamic_attn_mask ................... False
adaptive_cp_manually_set_mask_list .............. False
adaptive_cp_only_reschedule ..................... False
adaptive_cp_without_coarse ...................... False
adaptive_recompute_device_size .................. -1
adaptive_recompute_device_swap .................. False
adaptive_recompute_profiling_step ............... 10
add_bias_linear ................................. False
add_dense_bias .................................. False
add_eos_token ................................... []
add_output_layer_bias ........................... False
add_position_embedding .......................... True
add_qkv_bias .................................... False
add_rmsnorm_offset .............................. False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
ampipe_degree ................................... 0
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.0
attention_mask_on_cpu ........................... False
attention_mask_type ............................. causal
attention_softmax_in_fp32 ....................... True
attn_logit_softcapping .......................... None
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
calculate_per_token_loss ........................ False
check_for_nan_in_loss_and_grad .................. True
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clip_ratio ...................................... 0.2
cliprange_value ................................. 0.2
clone_scatter_output_in_embedding ............... True
coc_fused_kernel ................................ False
coc_mode ........................................ -1
coc_parallel_num ................................ 1
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_algo ........................... ulysses_cp_algo
context_parallel_size ........................... 1
cp_attention_mask_type .......................... causal
cp_window_size .................................. 1
create_attention_mask_in_dataloader ............. False
critic_mini_batch_size .......................... 1
critic_update_epochs ............................ 1
cross_entropy_loss_fusion ....................... False
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... ['./dataset/enwiki_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... False
ddp_bucket_size ................................. None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
defer_embedding_wgrad_compute ................... False
delay_grad_reduce ............................... True
delay_param_gather .............................. False
deterministic_mode .............................. False
dim_model_base .................................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_gloo_group .............................. None
disable_straggler_on_startup .................... False
dist_ckpt_format ................................ torch_dist
dist_ckpt_strictness ............................ assume_ok_unexpected
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 45
do_sample ....................................... False
dpo_beta ........................................ 0.1
dpo_label_smoothing ............................. 0.0
dpo_loss_type ................................... sigmoid
dynamic_factor .................................. 1.0
embed_layernorm ................................. False
embedding_multiplier_scale ...................... 1.0
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_backward_overlap_ag_with_matmul .......... False
enable_hbmfault_repair .......................... False
enable_high_availability ........................ False
enable_one_logger ............................... True
enable_optimizer_state_local_copy ............... False
enable_overlap_ag_with_matmul ................... False
enable_overlap_matmul_with_rs ................... False
enable_recompute_layers_per_pp_rank ............. False
enable_token_rearrange_opt ...................... False
encoder_num_layers .............................. 2
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.01
entropy_coeff ................................... 0.001
eod_mask_loss ................................... False
eval_interval ................................... 2000
eval_iters ...................................... 0
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_interval ................................. 1
expert_model_parallel_size ...................... 1
ffn_hidden_size ................................. 18432
fill_neg_inf .................................... False
finetune ........................................ False
first_k_dense_replace ........................... 1
fix_router ...................................... False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
full_shuffle_instruction_dataset ................ False
gamma ........................................... 1.0
gamma_beta_ratio ................................ 1.4
geglu ........................................... False
gelu_tanh ....................................... False
gemm_gradient_accumulation_fusion ............... True
global_batch_size ............................... 768
gradient_accumulation_fusion .................... False
group_query_attention ........................... False
hccl_group_buffer ............................... None
head_lr_mult .................................... 1.0
hf_chat_template ................................ False
hidden_dropout .................................. 0.0
hidden_size ..................................... 7168
high_freq_factor ................................ None
history_turns ................................... 3
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
inference_tensor_model_parallel_size ............ 1
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 65536.0
input_embeds_norm ............................... False
input_jitter .................................... True
input_layernorm_in_fp32 ......................... False
interleave_sliding_window ....................... None
is_instruction_dataset .......................... False
is_pairwise_dataset ............................. False
iter_per_epoch .................................. 1250
jit_compile ..................................... False
kl_coef ......................................... 0.3
kv_channels ..................................... 56
kv_head_repeat_before_uly_alltoall .............. False
kv_lora_rank .................................... 128
lam ............................................. 0.95
lazy_mpu_init ................................... None
load ............................................ None
load_checkpoint_loosely ......................... False
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
long_factor ..................................... None
long_mscale ..................................... None
longrope_freqs_type ............................. mul
lora_alpha ...................................... 32
lora_fusion ..................................... False
lora_load ....................................... None
lora_modules_to_save ............................ None
lora_r .......................................... 16
lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm']
lora_target_modules ............................. []
loss_scale ...................................... None
loss_scale_window ............................... 1000
low_freq_factor ................................. None
lr .............................................. 1e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 500
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
make_vocab_size_divisible_by .................... 1
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_length ...................................... 256
max_new_tokens .................................. 128
max_position_embeddings ......................... 163840
max_prompt_length ............................... 512
max_tokens_to_oom ............................... 12000
md5_validate .................................... False
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-07
missing_eos_penalty ............................. 0.0
mmap_bin_files .................................. True
mock_data ....................................... 0
moe_allgather_overlap_comm ...................... False
moe_alltoall_overlap_comm ....................... False
moe_aux_loss_coeff .............................. 0.0
moe_comm_aux_loss_coeff ......................... 0.0
moe_device_level_aux_loss_coeff ................. 0.0
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_grouped_gemm ................................ True
moe_input_jitter_eps ............................ None
moe_intermediate_size ........................... 2048
moe_layer_freq .................................. 1
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_per_layer_logging ........................... False
moe_permutation_async_comm ...................... True
moe_router_bias_update_rate ..................... 0.001
moe_router_enable_expert_bias ................... True
moe_router_load_balancing_type .................. noaux_tc
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... sigmoid
moe_router_topk ................................. 8
moe_token_dispatcher_type ....................... alltoall
moe_token_drop_policy ........................... probs
moe_tp_extend_ep ................................ False
moe_train_capacity_factor ....................... 1.0
moe_without_activation .......................... False
moe_z_loss_coeff ................................ 0.0
moe_zero_memory ................................. disable
moe_zero_memory_num_layers ...................... None
mtp_loss_scale .................................. 0.3
multi_head_latent_attention ..................... True
n_samples_per_prompt ............................ 1
n_shared_experts ................................ 1
nccl_communicator_config_path ................... None
nd1_dim1_size ................................... 1
nd2_dim1_size ................................... 1
next_tockens .................................... 0
no_cut_token .................................... False
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_post_layer_norm .............................. False
no_save_optim ................................... True
no_save_rng ..................................... True
no_shared_storage ............................... True
no_shuffle ...................................... False
noisy_gate_policy ............................... None
noop_layers ..................................... None
norm_epsilon .................................... 1e-06
norm_topk_prob .................................. True
normalization ................................... RMSNorm
num_attention_heads ............................. 128
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_experts ..................................... 256
num_gpus_for_infer .............................. None
num_gpus_for_train .............................. None
num_layer_list .................................. None
num_layers ...................................... 2
num_layers_per_virtual_pipeline_stage ........... None
num_nextn_predict_layers ........................ 1
num_query_groups ................................ 1
num_samples_per_step ............................ 1
num_workers ..................................... 2
o2_gradient ..................................... False
o2_optimizer .................................... False
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
original_max_position_embeddings ................ None
output_bert_embeddings .......................... False
output_layer_slice_num .......................... 1
output_logit_softcapping ........................ None
output_multiplier_scale ......................... None
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
pad_to_multiple_of .............................. 8
padded_base_length .............................. 128
padded_vocab_size ............................... 129280
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
placeholder_token ............................... ки
position_embedding_type ......................... rope
post_norm ....................................... False
ppo_epochs ...................................... 1
ppo_mini_batch_size ............................. 1
pre_tockens ..................................... 65536
pref_ftx ........................................ 0.0
pretrained_checkpoint ........................... None
profile ......................................... False
profile_data_simplification ..................... False
profile_export_type ............................. text
profile_level ................................... level0
profile_ranks ................................... [-1]
profile_record_shapes ........................... False
profile_save_path ............................... ./profile_dir
profile_step_end ................................ 12
profile_step_start .............................. 10
profile_with_cpu ................................ False
profile_with_memory ............................. False
profile_with_stack .............................. False
prompt_type ..................................... None
prompt_type_path ................................ /root/kimchou/MindSpeed-LLM/configs/finetune/templates.json
q_lora_rank ..................................... 128
qk_layernorm .................................... True
qk_nope_head_dim ................................ 128
qk_rope_head_dim ................................ 64
qlora ........................................... False
qlora_save_dequantize ........................... False
query_in_block_prob ............................. 0.1
query_pre_attn_scalar ........................... None
rampup_batch_size ............................... None
rank ............................................ 0
recompute_activation_function ................... False
recompute_activation_function_num_layers ........ None
recompute_granularity ........................... full
recompute_in_advance ............................ False
recompute_in_bubble ............................. False
recompute_method ................................ uniform
recompute_mtp_norm .............................. True
recompute_num_layers ............................ 1
reduce_recompute_for_last_chunk ................. False
ref_model ....................................... None
refer_model_iter ................................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
reuse_fp32_param ................................ True
reward_model .................................... False
reward_tokens ................................... []
rope_scaling_beta_fast .......................... 32
rope_scaling_beta_slow .......................... 1
rope_scaling_factor ............................. 40.0
rope_scaling_mscale ............................. 1.0
rope_scaling_mscale_all_dim ..................... 1.0
rope_scaling_original_max_position_embeddings ... 4096
rope_scaling_type ............................... yarn
rotary_base ..................................... 10000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
routed_scaling_factor ........................... 2.5
router_gating_in_fp32 ........................... False
s3_cache_path ................................... None
sample_rate ..................................... 1.0
save ............................................ ./model_weights/deepseek3-mcore
save_interval ................................... 2000
scale_depth ..................................... None
scale_emb ....................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_aux ......................................... False
seq_length ...................................... 4096
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
shape_order ..................................... BNSD
share_mtp_embedding_and_output_weight ........... True
shared_expert_gate .............................. False
shared_expert_gate_output_dimension ............. 1
short_factor .................................... None
short_mscale .................................... None
short_seq_prob .................................. 0.1
shuffle_minibatch ............................... False
simpo_beta ...................................... 2.5
simpo_label_smoothing ........................... 0.0
simpo_loss_type ................................. sigmoid
skip_bias_add ................................... True
skip_train ...................................... False
sliding_window .................................. None
sparse_mode ..................................... 0
spec ............................................ ['mindspeed_llm.tasks.models.spec.deepseek_spec', 'layer_spec']
split ........................................... 1,1,1
square_alibi_mask ............................... False
squared_relu .................................... False
stage ........................................... None
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
swap_attention .................................. False
swap_modules .................................... None
swiglu .......................................... True
swin_backbone_type .............................. tiny
task ............................................ None
temperature ..................................... 0.7
tensor_model_parallel_size ...................... 8
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_kwargs ................................ None
tokenizer_model ................................. None
tokenizer_name_or_path .......................... ./model_from_hf/deepseek3-hf
tokenizer_not_use_fast .......................... True
tokenizer_padding_side .......................... right
tokenizer_type .................................. PretrainedFromHF
top_k ........................................... 50
top_p ........................................... 0.95
topk_group ...................................... 4
tp_2d ........................................... False
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
tp_x ............................................ 1
tp_y ............................................ 1
train_data_path ................................. None
train_iters ..................................... 2000
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
ulysses_degree_in_cp ............................ None
untie_embeddings_and_output_weights ............. True
use_ascend_coc .................................. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cp_send_recv_overlap ........................ False
use_cpu_initialization .......................... None
use_deter_comp .................................. False
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_fused_mlp ................................... False
use_fused_moe_token_permute_and_unpermute ....... True
use_fused_ring_attention_update ................. False
use_fused_rmsnorm ............................... True
use_fused_rotary_pos_emb ........................ True
use_fused_rotary_pos_emb_new .................... False
use_fused_swiglu ................................ True
use_glm_rope .................................... False
use_kv_cache .................................... False
use_legacy_models ............................... False
use_mc2 ......................................... False
use_mcore_models ................................ True
use_nanopipe .................................... False
use_nd_matmul ................................... False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. True
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 8
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... 129280
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
world_size ...................................... 8
yaml_cfg ........................................ None
-------------------- end of MindSpeed-LLM Arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of micro-batches to constant 768
building PretrainFromHF tokenizer. Vocab file is un-used, loading tokenizer from pre-trained model
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
initializing torch distributed ...
all tp gourps [[0, 1, 2, 3, 4, 5, 6, 7]]
all ep groups [[0], [1], [2], [3], [4], [5], [6], [7]]
all dp groups [[0], [1], [2], [3], [4], [5], [6], [7]]
all_dp_modulo_exp_group_ranks [[0], [1], [2], [3], [4], [5], [6], [7]]
all_tensor_and_expert_group_ranks [[0, 1, 2, 3, 4, 5, 6, 7]]
all_data_parallel_group_ranks_with_cp [[0], [1], [2], [3], [4], [5], [6], [7]]
initialized tensor model parallel with size 8
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/root/kimchou/MindSpeed-LLM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/root/kimchou/MindSpeed-LLM/megatron/core/datasets'done with dataset index builder. Compilation time: 0.125 seconds
time to initialize megatron (seconds): -36.104
[after megatron is initialized] datetime: 2025-02-18 20:36:43
building GPT model ...
number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (3, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (7, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (4, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (5, 0): 3133012736
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=False, use_distributed_optimizer=True, check_for_nan_in_grad=True, bucket_size=None, average_in_collective=False)
number of parameters on (tensor, pipeline) model parallel rank (2, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (6, 0): 3133012736
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (3133012736 elements):
module.mtp_layers.0.transformer_layer.mlp.shared_experts.linear_fc2.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_kvb.weight
module.mtp_layers.0.transformer_layer.input_layernorm.weight
module.decoder.layers.1.mlp.experts.weight2
module.decoder.layers.1.mlp.router.weight
module.decoder.layers.0.mlp.linear_fc1.weight
module.mtp_layers.0.transformer_layer.mlp.experts.weight2
module.mtp_layers.0.transformer_layer.self_attention.k_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.q_layernorm.weight
module.decoder.final_layernorm.weight
module.decoder.layers.1.pre_mlp_layernorm.weight
module.decoder.layers.0.self_attention.linear_qkv.weight
module.embedding.word_embeddings.weight
module.mtp_layers.0.final_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_qb.weight
module.mtp_layers.0.hnorm.weight
module.decoder.layers.1.self_attention.linear_qkv.weight
module.decoder.layers.0.input_layernorm.weight
module.mtp_layers.0.transformer_layer.pre_mlp_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_proj.weight
module.decoder.layers.1.mlp.experts.weight1
module.decoder.layers.1.self_attention.linear_qb.weight
module.decoder.layers.0.pre_mlp_layernorm.weight
module.decoder.layers.0.self_attention.linear_proj.weight
module.mtp_layers.0.eh_proj.weight
module.output_layer.weight
module.decoder.layers.1.self_attention.k_layernorm.weight
module.decoder.layers.0.self_attention.k_layernorm.weight
module.decoder.layers.0.self_attention.q_layernorm.weight
module.mtp_layers.0.transformer_layer.mlp.shared_experts.linear_fc1.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_qkv.weight
module.mtp_layers.0.enorm.weight
module.decoder.layers.1.self_attention.linear_proj.weight
module.decoder.layers.1.input_layernorm.weight
module.decoder.layers.0.self_attention.linear_qb.weight
module.mtp_layers.0.transformer_layer.mlp.experts.weight1
module.mtp_layers.0.transformer_layer.mlp.router.weight
module.decoder.layers.1.self_attention.linear_kvb.weight
module.decoder.layers.1.self_attention.q_layernorm.weight
module.decoder.layers.0.self_attention.linear_kvb.weight
module.decoder.layers.1.mlp.shared_experts.linear_fc2.weight
module.decoder.layers.1.mlp.shared_experts.linear_fc1.weight
module.decoder.layers.0.mlp.linear_fc2.weight
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=1e-05, min_lr=1e-07, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=65536.0, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0xffff74758280>)
ninja: no work to do.
learning rate decay style: cosine
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-02-18 20:36:48
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 1536000
validation: 0
test: 0
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.3333333333333333), (0.3333333333333333, 0.6666666666666666), (0.6666666666666666, 1.0)]
building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(1536000, 0, 0), and config=GPTDatasetConfig(random_seed=1234, sequence_length=4096, blend=(['./dataset/enwiki_text_document'], None), blend_per_split=[None, None, None], split='1,1,1', split_matrix=[(0, 0.3333333333333333), (0.3333333333333333, 0.6666666666666666), (0.6666666666666666, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<mindspeed_llm.training.tokenizer.tokenizer._AutoTokenizer object at 0xffff861901f0>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=False, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, s3_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ./dataset/enwiki_text_document.idx
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset: Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 156994
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 156994
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 1541528
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from a3926395be06957d42fec173e9eb7235-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from a3926395be06957d42fec173e9eb7235-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from a3926395be06957d42fec173e9eb7235-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 10081
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 10220
finished creating GPT datasets ...
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
[after dataloaders are built] datetime: 2025-02-18 20:36:48
done with setup ...
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (4432.19, 4527.12)
train/valid/test-data-iterators-setup ..........: (78.29, 349.61)
training ...
[before the start of training step] datetime: 2025-02-18 20:36:49
Calling _query_document_sample_shuffle_indices with idx: 4
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
[ERROR] 2025-02-18-20:36:51 (PID:3160063, Device:0, RankID:0) ERR99999 UNKNOWN application exception