192 Star 1.2K Fork 1.1K

GVPAscend/MindSpeed-LLM

【deepseekV3】预选连的时候mask和indexed tensor形状不匹配的问题

WIP
训练问题
创建于  
2025-02-18 21:28

一、问题现象(附报错日志上下文):
报错日志
如图,在训练过程中会提示IndexError: The shape of the mask [4097] at index 0 does not match the shape of the indexed tensor [4096] at index 0
二、软件版本:
-- CANN 版本 8.0.RC3.alpha003
--Pytorch 版本 2.1.0
--Python 版本 3.8
--操作系统版本 2.0 (SP10)

三、测试步骤:
ckpt_convert_deepseek3_hf2mcore.sh 因为只想试一下pretrain,模型又很大这一步就先跳过了。
data_convert_deepseek3_pretrain.sh 数据集照的是这个脚本,分词器下载的是https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main这里的,数据集是这里的第一条https://huggingface.co/datasets/lsb/enwiki20230101/tree/main/data。
pretrain_deepseek3_671b_4k_ptd.sh pretrain执行的是这一条。并行策略如下:
输入图片说明

四、日志信息:
完整的日志如下:
using world size: 8, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 8, pipeline-model-parallel size: 1
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
When context_parallel is not activated, kv_head_repeat_before_uly_alltoall would be set to False for reducing memory usage.
[INFO] Setting args.create_attention_mask_in_dataloader to False since reset_data=False or alibi_without_flash_attn=False or args.tokenizer_padding_side=right
------------------------ MindSpeed-LLM Arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adaptive_cp_dynamic_attn_mask ................... False
adaptive_cp_manually_set_mask_list .............. False
adaptive_cp_only_reschedule ..................... False
adaptive_cp_without_coarse ...................... False
adaptive_recompute_device_size .................. -1
adaptive_recompute_device_swap .................. False
adaptive_recompute_profiling_step ............... 10
add_bias_linear ................................. False
add_dense_bias .................................. False
add_eos_token ................................... []
add_output_layer_bias ........................... False
add_position_embedding .......................... True
add_qkv_bias .................................... False
add_rmsnorm_offset .............................. False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
ampipe_degree ................................... 0
app_tag_run_name ................................ None
app_tag_run_version ............................. 0.0.0
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_save ...................................... None
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.0
attention_mask_on_cpu ........................... False
attention_mask_type ............................. causal
attention_softmax_in_fp32 ....................... True
attn_logit_softcapping .......................... None
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
calculate_per_token_loss ........................ False
check_for_nan_in_loss_and_grad .................. True
check_weight_hash_across_dp_replicas_interval ... None
ckpt_assume_constant_structure .................. False
ckpt_fully_parallel_load ........................ False
ckpt_fully_parallel_save ........................ True
ckpt_fully_parallel_save_deprecated ............. False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clip_ratio ...................................... 0.2
cliprange_value ................................. 0.2
clone_scatter_output_in_embedding ............... True
coc_fused_kernel ................................ False
coc_mode ........................................ -1
coc_parallel_num ................................ 1
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_algo ........................... ulysses_cp_algo
context_parallel_size ........................... 1
cp_attention_mask_type .......................... causal
cp_window_size .................................. 1
create_attention_mask_in_dataloader ............. False
critic_mini_batch_size .......................... 1
critic_update_epochs ............................ 1
cross_entropy_loss_fusion ....................... False
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... ['./dataset/enwiki_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
ddp_average_in_collective ....................... False
ddp_bucket_size ................................. None
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
defer_embedding_wgrad_compute ................... False
delay_grad_reduce ............................... True
delay_param_gather .............................. False
deterministic_mode .............................. False
dim_model_base .................................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
disable_gloo_group .............................. None
disable_straggler_on_startup .................... False
dist_ckpt_format ................................ torch_dist
dist_ckpt_strictness ............................ assume_ok_unexpected
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 45
do_sample ....................................... False
dpo_beta ........................................ 0.1
dpo_label_smoothing ............................. 0.0
dpo_loss_type ................................... sigmoid
dynamic_factor .................................. 1.0
embed_layernorm ................................. False
embedding_multiplier_scale ...................... 1.0
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_backward_overlap_ag_with_matmul .......... False
enable_hbmfault_repair .......................... False
enable_high_availability ........................ False
enable_one_logger ............................... True
enable_optimizer_state_local_copy ............... False
enable_overlap_ag_with_matmul ................... False
enable_overlap_matmul_with_rs ................... False
enable_recompute_layers_per_pp_rank ............. False
enable_token_rearrange_opt ...................... False
encoder_num_layers .............................. 2
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.01
entropy_coeff ................................... 0.001
eod_mask_loss ................................... False
eval_interval ................................... 2000
eval_iters ...................................... 0
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_interval ................................. 1
expert_model_parallel_size ...................... 1
ffn_hidden_size ................................. 18432
fill_neg_inf .................................... False
finetune ........................................ False
first_k_dense_replace ........................... 1
fix_router ...................................... False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
full_shuffle_instruction_dataset ................ False
gamma ........................................... 1.0
gamma_beta_ratio ................................ 1.4
geglu ........................................... False
gelu_tanh ....................................... False
gemm_gradient_accumulation_fusion ............... True
global_batch_size ............................... 768
gradient_accumulation_fusion .................... False
group_query_attention ........................... False
hccl_group_buffer ............................... None
head_lr_mult .................................... 1.0
hf_chat_template ................................ False
hidden_dropout .................................. 0.0
hidden_size ..................................... 7168
high_freq_factor ................................ None
history_turns ................................... 3
hybrid_attention_ratio .......................... 0.0
hybrid_mlp_ratio ................................ 0.0
hybrid_override_pattern ......................... None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
inference_tensor_model_parallel_size ............ 1
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 65536.0
input_embeds_norm ............................... False
input_jitter .................................... True
input_layernorm_in_fp32 ......................... False
interleave_sliding_window ....................... None
is_instruction_dataset .......................... False
is_pairwise_dataset ............................. False
iter_per_epoch .................................. 1250
jit_compile ..................................... False
kl_coef ......................................... 0.3
kv_channels ..................................... 56
kv_head_repeat_before_uly_alltoall .............. False
kv_lora_rank .................................... 128
lam ............................................. 0.95
lazy_mpu_init ................................... None
load ............................................ None
load_checkpoint_loosely ......................... False
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... False
log_straggler ................................... False
log_throughput .................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
logging_level ................................... None
long_factor ..................................... None
long_mscale ..................................... None
longrope_freqs_type ............................. mul
lora_alpha ...................................... 32
lora_fusion ..................................... False
lora_load ....................................... None
lora_modules_to_save ............................ None
lora_r .......................................... 16
lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm']
lora_target_modules ............................. []
loss_scale ...................................... None
loss_scale_window ............................... 1000
low_freq_factor ................................. None
lr .............................................. 1e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 500
lr_warmup_samples ............................... 0
lr_wsd_decay_iters .............................. None
lr_wsd_decay_samples ............................ None
lr_wsd_decay_style .............................. exponential
make_vocab_size_divisible_by .................... 1
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_length ...................................... 256
max_new_tokens .................................. 128
max_position_embeddings ......................... 163840
max_prompt_length ............................... 512
max_tokens_to_oom ............................... 12000
md5_validate .................................... False
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-07
missing_eos_penalty ............................. 0.0
mmap_bin_files .................................. True
mock_data ....................................... 0
moe_allgather_overlap_comm ...................... False
moe_alltoall_overlap_comm ....................... False
moe_aux_loss_coeff .............................. 0.0
moe_comm_aux_loss_coeff ......................... 0.0
moe_device_level_aux_loss_coeff ................. 0.0
moe_expert_capacity_factor ...................... None
moe_extended_tp ................................. False
moe_grouped_gemm ................................ True
moe_input_jitter_eps ............................ None
moe_intermediate_size ........................... 2048
moe_layer_freq .................................. 1
moe_layer_recompute ............................. False
moe_pad_expert_input_to_capacity ................ False
moe_per_layer_logging ........................... False
moe_permutation_async_comm ...................... True
moe_router_bias_update_rate ..................... 0.001
moe_router_enable_expert_bias ................... True
moe_router_load_balancing_type .................. noaux_tc
moe_router_pre_softmax .......................... False
moe_router_score_function ....................... sigmoid
moe_router_topk ................................. 8
moe_token_dispatcher_type ....................... alltoall
moe_token_drop_policy ........................... probs
moe_tp_extend_ep ................................ False
moe_train_capacity_factor ....................... 1.0
moe_without_activation .......................... False
moe_z_loss_coeff ................................ 0.0
moe_zero_memory ................................. disable
moe_zero_memory_num_layers ...................... None
mtp_loss_scale .................................. 0.3
multi_head_latent_attention ..................... True
n_samples_per_prompt ............................ 1
n_shared_experts ................................ 1
nccl_communicator_config_path ................... None
nd1_dim1_size ................................... 1
nd2_dim1_size ................................... 1
next_tockens .................................... 0
no_cut_token .................................... False
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_post_layer_norm .............................. False
no_save_optim ................................... True
no_save_rng ..................................... True
no_shared_storage ............................... True
no_shuffle ...................................... False
noisy_gate_policy ............................... None
noop_layers ..................................... None
norm_epsilon .................................... 1e-06
norm_topk_prob .................................. True
normalization ................................... RMSNorm
num_attention_heads ............................. 128
num_channels .................................... 3
num_classes ..................................... 1000
num_dataset_builder_threads ..................... 1
num_experts ..................................... 256
num_gpus_for_infer .............................. None
num_gpus_for_train .............................. None
num_layer_list .................................. None
num_layers ...................................... 2
num_layers_per_virtual_pipeline_stage ........... None
num_nextn_predict_layers ........................ 1
num_query_groups ................................ 1
num_samples_per_step ............................ 1
num_workers ..................................... 2
o2_gradient ..................................... False
o2_optimizer .................................... False
one_logger_async ................................ False
one_logger_project .............................. megatron-lm
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
original_max_position_embeddings ................ None
output_bert_embeddings .......................... False
output_layer_slice_num .......................... 1
output_logit_softcapping ........................ None
output_multiplier_scale ......................... None
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
pad_to_multiple_of .............................. 8
padded_base_length .............................. 128
padded_vocab_size ............................... 129280
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
placeholder_token ............................... ки
position_embedding_type ......................... rope
post_norm ....................................... False
ppo_epochs ...................................... 1
ppo_mini_batch_size ............................. 1
pre_tockens ..................................... 65536
pref_ftx ........................................ 0.0
pretrained_checkpoint ........................... None
profile ......................................... False
profile_data_simplification ..................... False
profile_export_type ............................. text
profile_level ................................... level0
profile_ranks ................................... [-1]
profile_record_shapes ........................... False
profile_save_path ............................... ./profile_dir
profile_step_end ................................ 12
profile_step_start .............................. 10
profile_with_cpu ................................ False
profile_with_memory ............................. False
profile_with_stack .............................. False
prompt_type ..................................... None
prompt_type_path ................................ /root/kimchou/MindSpeed-LLM/configs/finetune/templates.json
q_lora_rank ..................................... 128
qk_layernorm .................................... True
qk_nope_head_dim ................................ 128
qk_rope_head_dim ................................ 64
qlora ........................................... False
qlora_save_dequantize ........................... False
query_in_block_prob ............................. 0.1
query_pre_attn_scalar ........................... None
rampup_batch_size ............................... None
rank ............................................ 0
recompute_activation_function ................... False
recompute_activation_function_num_layers ........ None
recompute_granularity ........................... full
recompute_in_advance ............................ False
recompute_in_bubble ............................. False
recompute_method ................................ uniform
recompute_mtp_norm .............................. True
recompute_num_layers ............................ 1
reduce_recompute_for_last_chunk ................. False
ref_model ....................................... None
refer_model_iter ................................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
reuse_fp32_param ................................ True
reward_model .................................... False
reward_tokens ................................... []
rope_scaling_beta_fast .......................... 32
rope_scaling_beta_slow .......................... 1
rope_scaling_factor ............................. 40.0
rope_scaling_mscale ............................. 1.0
rope_scaling_mscale_all_dim ..................... 1.0
rope_scaling_original_max_position_embeddings ... 4096
rope_scaling_type ............................... yarn
rotary_base ..................................... 10000
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
routed_scaling_factor ........................... 2.5
router_gating_in_fp32 ........................... False
s3_cache_path ................................... None
sample_rate ..................................... 1.0
save ............................................ ./model_weights/deepseek3-mcore
save_interval ................................... 2000
scale_depth ..................................... None
scale_emb ....................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_aux ......................................... False
seq_length ...................................... 4096
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
shape_order ..................................... BNSD
share_mtp_embedding_and_output_weight ........... True
shared_expert_gate .............................. False
shared_expert_gate_output_dimension ............. 1
short_factor .................................... None
short_mscale .................................... None
short_seq_prob .................................. 0.1
shuffle_minibatch ............................... False
simpo_beta ...................................... 2.5
simpo_label_smoothing ........................... 0.0
simpo_loss_type ................................. sigmoid
skip_bias_add ................................... True
skip_train ...................................... False
sliding_window .................................. None
sparse_mode ..................................... 0
spec ............................................ ['mindspeed_llm.tasks.models.spec.deepseek_spec', 'layer_spec']
split ........................................... 1,1,1
square_alibi_mask ............................... False
squared_relu .................................... False
stage ........................................... None
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
straggler_ctrlr_port ............................ 65535
straggler_minmax_count .......................... 1
swap_attention .................................. False
swap_modules .................................... None
swiglu .......................................... True
swin_backbone_type .............................. tiny
task ............................................ None
temperature ..................................... 0.7
tensor_model_parallel_size ...................... 8
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
tiktoken_num_special_tokens ..................... 1000
tiktoken_pattern ................................ None
tiktoken_special_tokens ......................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_kwargs ................................ None
tokenizer_model ................................. None
tokenizer_name_or_path .......................... ./model_from_hf/deepseek3-hf
tokenizer_not_use_fast .......................... True
tokenizer_padding_side .......................... right
tokenizer_type .................................. PretrainedFromHF
top_k ........................................... 50
top_p ........................................... 0.95
topk_group ...................................... 4
tp_2d ........................................... False
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_overlap_rs_dgrad ........................ False
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
tp_x ............................................ 1
tp_y ............................................ 1
train_data_path ................................. None
train_iters ..................................... 2000
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
ulysses_degree_in_cp ............................ None
untie_embeddings_and_output_weights ............. True
use_ascend_coc .................................. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cp_send_recv_overlap ........................ False
use_cpu_initialization .......................... None
use_deter_comp .................................. False
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_fused_mlp ................................... False
use_fused_moe_token_permute_and_unpermute ....... True
use_fused_ring_attention_update ................. False
use_fused_rmsnorm ............................... True
use_fused_rotary_pos_emb ........................ True
use_fused_rotary_pos_emb_new .................... False
use_fused_swiglu ................................ True
use_glm_rope .................................... False
use_kv_cache .................................... False
use_legacy_models ............................... False
use_mc2 ......................................... False
use_mcore_models ................................ True
use_nanopipe .................................... False
use_nd_matmul ................................... False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. True
use_tp_pp_dp_mapping ............................ False
v_head_dim ...................................... 8
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... 129280
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
world_size ...................................... 8
yaml_cfg ........................................ None
-------------------- end of MindSpeed-LLM Arguments ---------------------
INFO:megatron.core.num_microbatches_calculator:setting number of micro-batches to constant 768

building PretrainFromHF tokenizer. Vocab file is un-used, loading tokenizer from pre-trained model
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
initializing torch distributed ...
all tp gourps [[0, 1, 2, 3, 4, 5, 6, 7]]
all ep groups [[0], [1], [2], [3], [4], [5], [6], [7]]
all dp groups [[0], [1], [2], [3], [4], [5], [6], [7]]
all_dp_modulo_exp_group_ranks [[0], [1], [2], [3], [4], [5], [6], [7]]
all_tensor_and_expert_group_ranks [[0, 1, 2, 3, 4, 5, 6, 7]]
all_data_parallel_group_ranks_with_cp [[0], [1], [2], [3], [4], [5], [6], [7]]
initialized tensor model parallel with size 8
initialized pipeline model parallel with size 1
setting random seeds to 1234 ...
compiling dataset index builder ...
make: Entering directory '/root/kimchou/MindSpeed-LLM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/root/kimchou/MindSpeed-LLM/megatron/core/datasets'

done with dataset index builder. Compilation time: 0.125 seconds
time to initialize megatron (seconds): -36.104
[after megatron is initialized] datetime: 2025-02-18 20:36:43
building GPT model ...
number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (3, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (7, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (4, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (5, 0): 3133012736
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=False, use_distributed_optimizer=True, check_for_nan_in_grad=True, bucket_size=None, average_in_collective=False)
number of parameters on (tensor, pipeline) model parallel rank (2, 0): 3133012736
number of parameters on (tensor, pipeline) model parallel rank (6, 0): 3133012736
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (3133012736 elements):
module.mtp_layers.0.transformer_layer.mlp.shared_experts.linear_fc2.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_kvb.weight
module.mtp_layers.0.transformer_layer.input_layernorm.weight
module.decoder.layers.1.mlp.experts.weight2
module.decoder.layers.1.mlp.router.weight
module.decoder.layers.0.mlp.linear_fc1.weight
module.mtp_layers.0.transformer_layer.mlp.experts.weight2
module.mtp_layers.0.transformer_layer.self_attention.k_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.q_layernorm.weight
module.decoder.final_layernorm.weight
module.decoder.layers.1.pre_mlp_layernorm.weight
module.decoder.layers.0.self_attention.linear_qkv.weight
module.embedding.word_embeddings.weight
module.mtp_layers.0.final_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_qb.weight
module.mtp_layers.0.hnorm.weight
module.decoder.layers.1.self_attention.linear_qkv.weight
module.decoder.layers.0.input_layernorm.weight
module.mtp_layers.0.transformer_layer.pre_mlp_layernorm.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_proj.weight
module.decoder.layers.1.mlp.experts.weight1
module.decoder.layers.1.self_attention.linear_qb.weight
module.decoder.layers.0.pre_mlp_layernorm.weight
module.decoder.layers.0.self_attention.linear_proj.weight
module.mtp_layers.0.eh_proj.weight
module.output_layer.weight
module.decoder.layers.1.self_attention.k_layernorm.weight
module.decoder.layers.0.self_attention.k_layernorm.weight
module.decoder.layers.0.self_attention.q_layernorm.weight
module.mtp_layers.0.transformer_layer.mlp.shared_experts.linear_fc1.weight
module.mtp_layers.0.transformer_layer.self_attention.linear_qkv.weight
module.mtp_layers.0.enorm.weight
module.decoder.layers.1.self_attention.linear_proj.weight
module.decoder.layers.1.input_layernorm.weight
module.decoder.layers.0.self_attention.linear_qb.weight
module.mtp_layers.0.transformer_layer.mlp.experts.weight1
module.mtp_layers.0.transformer_layer.mlp.router.weight
module.decoder.layers.1.self_attention.linear_kvb.weight
module.decoder.layers.1.self_attention.q_layernorm.weight
module.decoder.layers.0.self_attention.linear_kvb.weight
module.decoder.layers.1.mlp.shared_experts.linear_fc2.weight
module.decoder.layers.1.mlp.shared_experts.linear_fc1.weight
module.decoder.layers.0.mlp.linear_fc2.weight
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=1e-05, min_lr=1e-07, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=65536.0, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0xffff74758280>)
ninja: no work to do.
learning rate decay style: cosine
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-02-18 20:36:48
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 1536000
validation: 0
test: 0
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.3333333333333333), (0.3333333333333333, 0.6666666666666666), (0.6666666666666666, 1.0)]
building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(1536000, 0, 0), and config=GPTDatasetConfig(random_seed=1234, sequence_length=4096, blend=(['./dataset/enwiki_text_document'], None), blend_per_split=[None, None, None], split='1,1,1', split_matrix=[(0, 0.3333333333333333), (0.3333333333333333, 0.6666666666666666), (0.6666666666666666, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<mindspeed_llm.training.tokenizer.tokenizer._AutoTokenizer object at 0xffff861901f0>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=False, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True, s3_cache_path=None)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from ./dataset/enwiki_text_document.idx
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset: Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 156994
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 156994
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from 1d6853d8f310f93961152c3757e6b739-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 1541528
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from a3926395be06957d42fec173e9eb7235-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from a3926395be06957d42fec173e9eb7235-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from a3926395be06957d42fec173e9eb7235-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 10081
INFO:mindspeed_llm.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the document index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-document_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the sample index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-sample_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset: Load the shuffle index from e08b14f953d1197a7e9dda13bed270ad-GPTDataset-shuffle_index.npy
INFO:mindspeed_llm.core.datasets.gpt_dataset:> total number of samples: 10220
finished creating GPT datasets ...
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 0
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
[after dataloaders are built] datetime: 2025-02-18 20:36:48
done with setup ...
Calling _query_document_sample_shuffle_indices with idx: 2
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 1
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
Calling _query_document_sample_shuffle_indices with idx: 3
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (4432.19, 4527.12)
train/valid/test-data-iterators-setup ..........: (78.29, 349.61)
training ...
[before the start of training step] datetime: 2025-02-18 20:36:49
Calling _query_document_sample_shuffle_indices with idx: 4
Labels shape on device 0before: (4098,)
Labels shape on device 0: torch.Size([4097])
Loss mask shape on device 0: torch.Size([4096])
[ERROR] 2025-02-18-20:36:51 (PID:3160063, Device:0, RankID:0) ERR99999 UNKNOWN application exception

评论 (1)

太阳终于出来啦 创建了训练问题 3个月前
shenjiarun 任务状态TODO 修改为WIP 3个月前
展开全部操作日志

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(1)
太阳终于出来啦-the-sun-has-finally-come-out
Python
1
https://gitee.com/ascend/MindSpeed-LLM.git
git@gitee.com:ascend/MindSpeed-LLM.git
ascend
MindSpeed-LLM
MindSpeed-LLM

搜索帮助