登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
登录
注册
代码拉取完成,页面将自动刷新
开源项目
>
人工智能
&&
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
201
Star
1.3K
Fork
1.2K
GVP
Ascend
/
MindSpeed-LLM
代码
Issues
1
Pull Requests
51
Wiki
统计
流水线
服务
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
qwen-7b将权重从 huggingface 格式转化为 megatron 格式程序崩溃
DONE
#I9IRFI
需求
tiandk
创建于
2024-04-22 21:06
执行命令如下命令崩溃。我的机器是2张duo卡,cann版本7.0,驱动23.0.1,torch 2.1.0+cpu。 python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias 附完整操作记录和输出日志。  root @ /home/watrix/data/tiandk/ModelLink # python3 Python 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.1.0+cpu' >>> root @ /home/watrix/data/tiandk/ModelLink # npu-smi info +--------------------------------------------------------------------------------------------------------+ | npu-smi 23.0.1 Version: 23.0.1 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 21760 310P3 | OK | NA 37 0 / 0 | | 0 0 | 0000:56:00.0 | 0 1422 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21760 310P3 | OK | NA 36 0 / 0 | | 1 1 | 0000:56:00.0 | 0 1423 / 43741 | +===============================+=================+======================================================+ | 21888 310P3 | OK | NA 39 0 / 0 | | 0 2 | 0000:57:00.0 | 0 1347 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21888 310P3 | OK | NA 35 0 / 0 | | 1 3 | 0000:57:00.0 | 0 1495 / 43741 | +===============================+=================+======================================================+ +-------------------------------+-----------------+------------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===============================+=================+======================================================+ | No running processes found in NPU 21760 | +===============================+=================+======================================================+ | No running processes found in NPU 21888 | +===============================+=================+======================================================+ root @ /home/watrix/data/tiandk/ModelLink # lscpu 架构: x86_64 CPU 运行模式: 32-bit, 64-bit 字节序: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU: 96 在线 CPU 列表: 0-95 每个核的线程数: 2 每个座的核数: 24 座: 2 NUMA 节点: 2 厂商 ID: GenuineIntel CPU 系列: 6 型号: 106 型号名称: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz 步进: 6 Frequency boost: enabled CPU MHz: 2196.890 CPU 最大 MHz: 3500.0000 CPU 最小 MHz: 800.0000 BogoMIPS: 5600.00 虚拟化: VT-x L1d 缓存: 2.3 MiB L1i 缓存: 1.5 MiB L2 缓存: 60 MiB L3 缓存: 72 MiB NUMA 节点0 CPU: 0-23,48-71 NUMA 节点1 CPU: 24-47,72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe s yscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf p ni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc nt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms i nvpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl x saveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vb mi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities root @ /home/watrix/data/tiandk/ModelLink # cd /usr/local/Ascend/ascend-toolkit/ root @ /usr/local/Ascend/ascend-toolkit # ls 7.0 7.0.0 latest set_env.sh root @ /usr/local/Ascend/ascend-toolkit # npu-smi info +--------------------------------------------------------------------------------------------------------+ | npu-smi 23.0.1 Version: 23.0.1 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 21760 310P3 | OK | NA 37 0 / 0 | | 0 0 | 0000:56:00.0 | 0 1423 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21760 310P3 | OK | NA 36 0 / 0 | | 1 1 | 0000:56:00.0 | 0 1423 / 43741 | +===============================+=================+======================================================+ | 21888 310P3 | OK | NA 39 0 / 0 | | 0 2 | 0000:57:00.0 | 0 1347 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21888 310P3 | OK | NA 34 0 / 0 | | 1 3 | 0000:57:00.0 | 0 1496 / 43741 | +===============================+=================+======================================================+ +-------------------------------+-----------------+------------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===============================+=================+======================================================+ | No running processes found in NPU 21760 | +===============================+=================+======================================================+ | No running processes found in NPU 21888 | +===============================+=================+======================================================+ root @ /usr/local/Ascend/ascend-toolkit # cd /home/watrix/data/tiandk/ModelLink/ root @ /home/watrix/data/tiandk/ModelLink # cat run.sh #!/bin/bash #python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-VL-Chat/ --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1/ --tokenizer-model ./model_from_hf/Qwen-VL-Chat/qwen.tiktoken --add-qkv-bias python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias root @ /home/watrix/data/tiandk/ModelLink # gdb --args python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (No debugging symbols found in python3) (gdb) r Starting program: /usr/bin/python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [Detaching after fork from child process 138874] [New Thread 0x7fff412b2700 (LWP 138937)] [New Thread 0x7fff2ee7f700 (LWP 138938)] /usr/local/lib/python3.8/dist-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair. warnings.warn( /usr/local/lib/python3.8/dist-packages/torch_npu/contrib/transfer_to_npu.py:164: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty ************************************************************************************************************* warnings.warn(msg, ImportWarning) [Thread 0x7fffdc2f0700 (LWP 138875) exited] [Detaching after fork from child process 138940] Zarr-based strategies will not be registered because of missing packages /usr/local/lib/python3.8/dist-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it. warnings.warn(msg, RuntimeWarning) warning: Loadable section ".note.gnu.property" outside of ELF segments /usr/local/lib/python3.8/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( Loaded loader_qwen_hf as the loader. Loaded saver_megatron as the saver. Starting saver... [Detaching after fork from child process 138941] Starting loader... using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adaptive_recompute_device_size .................. -1 adaptive_recompute_device_swap .................. False adaptive_recompute_profiling_step ............... 10 add_bias_linear ................................. False add_dense_bias .................................. False add_position_embedding .......................... True add_qkv_bias .................................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None delay_grad_reduce ............................... True delay_param_gather .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embed_layernorm ................................. False embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 32 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 100 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_interval ................................. 1 expert_model_parallel_size ...................... 1 ffn_hidden_size ................................. 11008 fill_neg_inf .................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 1024 gradient_accumulation_fusion .................... False group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 is_instruction_dataset .......................... False iter_per_epoch .................................. 1250 iteration ....................................... 1 kv_channels ..................................... 128 lazy_mpu_init ................................... None llama ........................................... {'architectures': ['QWenLMHeadModel'], 'auto_map': {'AutoConfig': 'configuration_qwen.QWenConfig', 'AutoModelForCausalLM': 'modeling_qwen.QWenLMHeadModel'}, 'attn_dropout_prob': 0.0, 'bf16': False, 'emb_dropout_prob': 0.0, 'fp16': False, 'fp32': False, 'hidden_size': 4096, 'intermediate_size': 22016, 'initializer_range': 0.02, 'kv_channels': 128, 'layer_norm_epsilon': 1e-06, 'max_position_embeddings': 8192, 'model_type': 'qwen', 'no_bias': True, 'num_attention_heads': 32, 'num_hidden_layers': 32, 'onnx_safe': None, 'rotary_emb_base': 10000, 'rotary_pct': 1.0, 'scale_attn_weights': True, 'seq_length': 8192, 'tie_word_embeddings': False, 'tokenizer_class': 'QWenTokenizer', 'transformers_version': '4.32.0', 'use_cache': True, 'use_dynamic_ntk': True, 'use_flash_attn': 'auto', 'use_logn_attn': True, 'vocab_size': 151936} load ............................................ ./model_from_hf/Qwen-7B-Chat local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_throughput .................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False lora_alpha ...................................... 32 lora_load ....................................... None lora_modules_to_save ............................ None lora_r .......................................... 16 lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm'] lora_target_modules ............................. [] loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. None lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 8192 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 moe_aux_loss_coeff .............................. 0.0 moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_train_capacity_factor ....................... 1.0 moe_z_loss_coeff ................................ 0.0 nccl_communicator_config_path ................... None next_tockens .................................... 0 no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... True no_save_rng ..................................... True noisy_gate_policy ............................... None norm_epsilon .................................... 1e-06 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layer_list .................................. None num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimize_recomp_communication_level ............. 0 optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False padded_vocab_size ............................... 151936 params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... False pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pre_tockens ..................................... 65536 profile ......................................... False profile_level ................................... level0 profile_ranks ................................... [0] profile_record_shapes ........................... False profile_save_path ............................... ./profile_dir profile_step_end ................................ 12 profile_step_start .............................. 10 profile_with_cpu ................................ False profile_with_memory ............................. False profile_with_stack .............................. False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_verify_neighbor_count ..................... True retro_workdir ................................... None reuse_fp32_param ................................ False rotary_base ..................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... None scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 shape_order ..................................... SBH short_seq_prob .................................. 0.1 skip_bias_add ................................... True skip_train ...................................... False spec ............................................ None split ........................................... 969, 30, 1 square_alibi_mask ............................... False squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_kwargs ................................ None tokenizer_model ................................. ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken tokenizer_name_or_path .......................... None tokenizer_not_use_fast .......................... True tokenizer_padding_side .......................... right tokenizer_type .................................. PretrainedFromHF tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_cfg ............................. None tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... None train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... True use_distributed_optimizer ....................... False use_flash_attn .................................. False use_fused_rmsnorm ............................... False use_fused_rotary_pos_emb ........................ False use_fused_swiglu ................................ False use_mcore_models ................................ False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... 151936 wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1024 [New Thread 0x7fffcebf8700 (LWP 138942)] [New Thread 0x7fffcebd7700 (LWP 138943)] [New Thread 0x7fffcebb6700 (LWP 138944)] [Detaching after fork from child process 138945] [New Thread 0x7fff412b2700 (LWP 138956)] [Detaching after fork from child process 139061] [Detaching after fork from child process 139062] [New Thread 0x7fff43ab3700 (LWP 139071)] [New Thread 0x7fff462b4700 (LWP 139072)] [New Thread 0x7fff48ab5700 (LWP 139073)] WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:The model is automatically converting to fp16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention [New Thread 0x7fff912d2700 (LWP 139351)] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s][New Thread 0x7fff8ead1700 (LWP 139352)] [New Thread 0x7fff8c2d0700 (LWP 139353)] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.13it/s] building GPT model ... [New Thread 0x7ffe897fd700 (LWP 139446)] set layer states: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 12.91it/s] [New Thread 0x7ffedbfff700 (LWP 139447)] sending embeddings [New Thread 0x7ffed8ff9700 (LWP 139453)] Overwriting default ffn_hidden_size value None with value from checkpoint 11008. Overwriting default kv_channels value None with value from checkpoint 128. Overwriting default use_rotary_position_embeddings value False with value from checkpoint True. Overwriting default normalization value LayerNorm with value from checkpoint RMSNorm. Overwriting default norm_epsilon value 1e-05 with value from checkpoint 1e-06. Overwriting default swiglu value False with value from checkpoint True. Overwriting default global_batch_size value None with value from checkpoint 1024. Overwriting default dataloader_type value None with value from checkpoint single. Overwriting default load value None with value from checkpoint ./model_from_hf/Qwen-7B-Chat. Overwriting default overlap_p2p_comm value True with value from checkpoint False. Overwriting default vocab_size value None with value from checkpoint 151936. Overwriting default padded_vocab_size value None with value from checkpoint 151936. Overwriting default add_qkv_bias value False with value from checkpoint True. Checkpoint had argument iteration but new arguments does not have this. Checkpoint had argument llama but new arguments does not have this. Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this. Checkpoint had argument data_parallel_size but new arguments does not have this. Checkpoint had argument consumed_train_samples but new arguments does not have this. Checkpoint had argument consumed_valid_samples but new arguments does not have this. Checkpoint had argument disable_bias_linear but new arguments does not have this. Checkpoint had argument model_type but new arguments does not have this. using world size: 4, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 1 WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication using torch.float16 for parameters ... ------------------------ arguments ------------------------ [New Thread 0x7ffed87f8700 (LWP 139454)] accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adaptive_recompute_device_size .................. -1 adaptive_recompute_device_swap .................. False adaptive_recompute_profiling_step ............... 10 add_bias_linear ................................. False add_dense_bias .................................. False add_position_embedding .......................... True add_qkv_bias .................................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None delay_grad_reduce ............................... True delay_param_gather .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embed_layernorm ................................. False embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 32 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 100 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_interval ................................. 1 expert_model_parallel_size ...................... 1 ffn_hidden_size ................................. 11008 fill_neg_inf .................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 1024 gradient_accumulation_fusion .................... False group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 is_instruction_dataset .......................... False iter_per_epoch .................................. 1250 kv_channels ..................................... 128 lazy_mpu_init ................................... None load ............................................ ./model_from_hf/Qwen-7B-Chat local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_throughput .................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False lora_alpha ...................................... 32 lora_load ....................................... None lora_modules_to_save ............................ None lora_r .......................................... 16 lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm'] lora_target_modules ............................. [] loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. None lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 1 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 8192 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 moe_aux_loss_coeff .............................. 0.0 moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_train_capacity_factor ....................... 1.0 moe_z_loss_coeff ................................ 0.0 nccl_communicator_config_path ................... None next_tockens .................................... 0 no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... True no_save_rng ..................................... True noisy_gate_policy ............................... None norm_epsilon .................................... 1e-06 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layer_list .................................. None num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimize_recomp_communication_level ............. 0 optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False padded_vocab_size ............................... 151936 params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... False pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pre_tockens ..................................... 65536 profile ......................................... False profile_level ................................... level0 profile_ranks ................................... [0] profile_record_shapes ........................... False profile_save_path ............................... ./profile_dir profile_step_end ................................ 12 profile_step_start .............................. 10 profile_with_cpu ................................ False profile_with_memory ............................. False profile_with_stack .............................. False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_verify_neighbor_count ..................... True retro_workdir ................................... None reuse_fp32_param ................................ False rotary_base ..................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ ./model_weights/Qwen-7B-v0.1-tp4-pp1 save_interval ................................... 1 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 shape_order ..................................... SBH short_seq_prob .................................. 0.1 skip_bias_add ................................... True skip_train ...................................... False spec ............................................ None split ........................................... 969, 30, 1 square_alibi_mask ............................... False squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 4 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_kwargs ................................ None tokenizer_model ................................. None tokenizer_name_or_path .......................... None tokenizer_not_use_fast .......................... True tokenizer_padding_side .......................... right tokenizer_type .................................. PretrainedFromHF tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_cfg ............................. None tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... None train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... True use_distributed_optimizer ....................... False use_flash_attn .................................. False use_fused_rmsnorm ............................... False use_fused_rotary_pos_emb ........................ False use_fused_swiglu ................................ False use_mcore_models ................................ False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... 151936 w_pack .......................................... False wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 4 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1024 Setting consumed_train_samples to 0 and consumed_valid_samples to 0 [New Thread 0x7ffe7b7fe700 (LWP 139494)] --Type <RET> for more, q to quit, c to continue without paging-- Thread 185 "python3" received signal SIGBUS, Bus error. [Switching to Thread 0x7ffecaffd700 (LWP 139463)] 0x00007fffe52057c3 in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#1}::operator()() const::{lambda(unsigned char)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<unsigned char>)#2}> >(long, char**, long const*, long, long) () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so (gdb) bt #0 0x00007fffe52057c3 in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#1}::operator()() const::{lambda(unsigned char)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<unsigned char>)#2}> >(long, char**, long const*, long, long) () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #1 0x00007fffe0a20cee in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #2 0x00007fffe0a20e22 in void at::internal::invoke_parallel<at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1}>(long, long, long, at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1} const&) [clone ._omp_fn.0] () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #3 0x00007ffff6efa405 in ?? () from /usr/local/lib/python3.8/dist-packages/torch/lib/libgomp-a34b3233.so.1 #4 0x00007ffff7db1609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #5 0x00007ffff7eeb353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) quit A debugging session is active. Inferior 1 [process 138870] will be killed. Quit anyway? (y or n) y root @ /home/watrix/data/tiandk/ModelLink # Process ForkServerPoolWorker-3: Process ForkServerPoolWorker-9: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put self._writer.send_bytes(obj) File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe During handling of the above exception, another exception occurred: BrokenPipeError: [Errno 32] Broken pipe /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 97 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
执行命令如下命令崩溃。我的机器是2张duo卡,cann版本7.0,驱动23.0.1,torch 2.1.0+cpu。 python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias 附完整操作记录和输出日志。  root @ /home/watrix/data/tiandk/ModelLink # python3 Python 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.1.0+cpu' >>> root @ /home/watrix/data/tiandk/ModelLink # npu-smi info +--------------------------------------------------------------------------------------------------------+ | npu-smi 23.0.1 Version: 23.0.1 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 21760 310P3 | OK | NA 37 0 / 0 | | 0 0 | 0000:56:00.0 | 0 1422 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21760 310P3 | OK | NA 36 0 / 0 | | 1 1 | 0000:56:00.0 | 0 1423 / 43741 | +===============================+=================+======================================================+ | 21888 310P3 | OK | NA 39 0 / 0 | | 0 2 | 0000:57:00.0 | 0 1347 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21888 310P3 | OK | NA 35 0 / 0 | | 1 3 | 0000:57:00.0 | 0 1495 / 43741 | +===============================+=================+======================================================+ +-------------------------------+-----------------+------------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===============================+=================+======================================================+ | No running processes found in NPU 21760 | +===============================+=================+======================================================+ | No running processes found in NPU 21888 | +===============================+=================+======================================================+ root @ /home/watrix/data/tiandk/ModelLink # lscpu 架构: x86_64 CPU 运行模式: 32-bit, 64-bit 字节序: Little Endian Address sizes: 46 bits physical, 57 bits virtual CPU: 96 在线 CPU 列表: 0-95 每个核的线程数: 2 每个座的核数: 24 座: 2 NUMA 节点: 2 厂商 ID: GenuineIntel CPU 系列: 6 型号: 106 型号名称: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz 步进: 6 Frequency boost: enabled CPU MHz: 2196.890 CPU 最大 MHz: 3500.0000 CPU 最小 MHz: 800.0000 BogoMIPS: 5600.00 虚拟化: VT-x L1d 缓存: 2.3 MiB L1i 缓存: 1.5 MiB L2 缓存: 60 MiB L3 缓存: 72 MiB NUMA 节点0 CPU: 0-23,48-71 NUMA 节点1 CPU: 24-47,72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe s yscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf p ni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc nt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms i nvpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl x saveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vb mi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities root @ /home/watrix/data/tiandk/ModelLink # cd /usr/local/Ascend/ascend-toolkit/ root @ /usr/local/Ascend/ascend-toolkit # ls 7.0 7.0.0 latest set_env.sh root @ /usr/local/Ascend/ascend-toolkit # npu-smi info +--------------------------------------------------------------------------------------------------------+ | npu-smi 23.0.1 Version: 23.0.1 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 21760 310P3 | OK | NA 37 0 / 0 | | 0 0 | 0000:56:00.0 | 0 1423 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21760 310P3 | OK | NA 36 0 / 0 | | 1 1 | 0000:56:00.0 | 0 1423 / 43741 | +===============================+=================+======================================================+ | 21888 310P3 | OK | NA 39 0 / 0 | | 0 2 | 0000:57:00.0 | 0 1347 / 44232 | +-------------------------------+-----------------+------------------------------------------------------+ | 21888 310P3 | OK | NA 34 0 / 0 | | 1 3 | 0000:57:00.0 | 0 1496 / 43741 | +===============================+=================+======================================================+ +-------------------------------+-----------------+------------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===============================+=================+======================================================+ | No running processes found in NPU 21760 | +===============================+=================+======================================================+ | No running processes found in NPU 21888 | +===============================+=================+======================================================+ root @ /usr/local/Ascend/ascend-toolkit # cd /home/watrix/data/tiandk/ModelLink/ root @ /home/watrix/data/tiandk/ModelLink # cat run.sh #!/bin/bash #python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-VL-Chat/ --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1/ --tokenizer-model ./model_from_hf/Qwen-VL-Chat/qwen.tiktoken --add-qkv-bias python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias root @ /home/watrix/data/tiandk/ModelLink # gdb --args python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python3... (No debugging symbols found in python3) (gdb) r Starting program: /usr/bin/python3 tools/checkpoint/convert_ckpt.py --model-type GPT --loader qwen_hf --saver megatron --target-tensor-parallel-size 4 --load-dir ./model_from_hf/Qwen-7B-Chat --save-dir ./model_weights/Qwen-7B-v0.1-tp4-pp1 --tokenizer-model ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken --add-qkv-bias [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [Detaching after fork from child process 138874] [New Thread 0x7fff412b2700 (LWP 138937)] [New Thread 0x7fff2ee7f700 (LWP 138938)] /usr/local/lib/python3.8/dist-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair. warnings.warn( /usr/local/lib/python3.8/dist-packages/torch_npu/contrib/transfer_to_npu.py:164: ImportWarning: ************************************************************************************************************* The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now.. The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now.. The backend in torch.distributed.init_process_group set to hccl now.. The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now.. The device parameters have been replaced with npu in the function below: torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty ************************************************************************************************************* warnings.warn(msg, ImportWarning) [Thread 0x7fffdc2f0700 (LWP 138875) exited] [Detaching after fork from child process 138940] Zarr-based strategies will not be registered because of missing packages /usr/local/lib/python3.8/dist-packages/torch_npu/contrib/transfer_to_npu.py:124: RuntimeWarning: torch.jit.script will be disabled by transfer_to_npu, which currently does not support it. warnings.warn(msg, RuntimeWarning) warning: Loadable section ".note.gnu.property" outside of ELF segments /usr/local/lib/python3.8/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( Loaded loader_qwen_hf as the loader. Loaded saver_megatron as the saver. Starting saver... [Detaching after fork from child process 138941] Starting loader... using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adaptive_recompute_device_size .................. -1 adaptive_recompute_device_swap .................. False adaptive_recompute_profiling_step ............... 10 add_bias_linear ................................. False add_dense_bias .................................. False add_position_embedding .......................... True add_qkv_bias .................................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None delay_grad_reduce ............................... True delay_param_gather .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embed_layernorm ................................. False embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 32 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 100 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_interval ................................. 1 expert_model_parallel_size ...................... 1 ffn_hidden_size ................................. 11008 fill_neg_inf .................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 1024 gradient_accumulation_fusion .................... False group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 is_instruction_dataset .......................... False iter_per_epoch .................................. 1250 iteration ....................................... 1 kv_channels ..................................... 128 lazy_mpu_init ................................... None llama ........................................... {'architectures': ['QWenLMHeadModel'], 'auto_map': {'AutoConfig': 'configuration_qwen.QWenConfig', 'AutoModelForCausalLM': 'modeling_qwen.QWenLMHeadModel'}, 'attn_dropout_prob': 0.0, 'bf16': False, 'emb_dropout_prob': 0.0, 'fp16': False, 'fp32': False, 'hidden_size': 4096, 'intermediate_size': 22016, 'initializer_range': 0.02, 'kv_channels': 128, 'layer_norm_epsilon': 1e-06, 'max_position_embeddings': 8192, 'model_type': 'qwen', 'no_bias': True, 'num_attention_heads': 32, 'num_hidden_layers': 32, 'onnx_safe': None, 'rotary_emb_base': 10000, 'rotary_pct': 1.0, 'scale_attn_weights': True, 'seq_length': 8192, 'tie_word_embeddings': False, 'tokenizer_class': 'QWenTokenizer', 'transformers_version': '4.32.0', 'use_cache': True, 'use_dynamic_ntk': True, 'use_flash_attn': 'auto', 'use_logn_attn': True, 'vocab_size': 151936} load ............................................ ./model_from_hf/Qwen-7B-Chat local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_throughput .................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False lora_alpha ...................................... 32 lora_load ....................................... None lora_modules_to_save ............................ None lora_r .......................................... 16 lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm'] lora_target_modules ............................. [] loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. None lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 8192 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 moe_aux_loss_coeff .............................. 0.0 moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_train_capacity_factor ....................... 1.0 moe_z_loss_coeff ................................ 0.0 nccl_communicator_config_path ................... None next_tockens .................................... 0 no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... True no_save_rng ..................................... True noisy_gate_policy ............................... None norm_epsilon .................................... 1e-06 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layer_list .................................. None num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimize_recomp_communication_level ............. 0 optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False padded_vocab_size ............................... 151936 params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... False pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pre_tockens ..................................... 65536 profile ......................................... False profile_level ................................... level0 profile_ranks ................................... [0] profile_record_shapes ........................... False profile_save_path ............................... ./profile_dir profile_step_end ................................ 12 profile_step_start .............................. 10 profile_with_cpu ................................ False profile_with_memory ............................. False profile_with_stack .............................. False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_verify_neighbor_count ..................... True retro_workdir ................................... None reuse_fp32_param ................................ False rotary_base ..................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... None scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 shape_order ..................................... SBH short_seq_prob .................................. 0.1 skip_bias_add ................................... True skip_train ...................................... False spec ............................................ None split ........................................... 969, 30, 1 square_alibi_mask ............................... False squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_kwargs ................................ None tokenizer_model ................................. ./model_from_hf/Qwen-7B-Chat/qwen.tiktoken tokenizer_name_or_path .......................... None tokenizer_not_use_fast .......................... True tokenizer_padding_side .......................... right tokenizer_type .................................. PretrainedFromHF tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_cfg ............................. None tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... None train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... True use_distributed_optimizer ....................... False use_flash_attn .................................. False use_fused_rmsnorm ............................... False use_fused_rotary_pos_emb ........................ False use_fused_swiglu ................................ False use_mcore_models ................................ False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... 151936 wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 1 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1024 [New Thread 0x7fffcebf8700 (LWP 138942)] [New Thread 0x7fffcebd7700 (LWP 138943)] [New Thread 0x7fffcebb6700 (LWP 138944)] [Detaching after fork from child process 138945] [New Thread 0x7fff412b2700 (LWP 138956)] [Detaching after fork from child process 139061] [Detaching after fork from child process 139062] [New Thread 0x7fff43ab3700 (LWP 139071)] [New Thread 0x7fff462b4700 (LWP 139072)] [New Thread 0x7fff48ab5700 (LWP 139073)] WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:The model is automatically converting to fp16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.Qwen-7B-Chat.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention [New Thread 0x7fff912d2700 (LWP 139351)] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s][New Thread 0x7fff8ead1700 (LWP 139352)] [New Thread 0x7fff8c2d0700 (LWP 139353)] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.13it/s] building GPT model ... [New Thread 0x7ffe897fd700 (LWP 139446)] set layer states: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 12.91it/s] [New Thread 0x7ffedbfff700 (LWP 139447)] sending embeddings [New Thread 0x7ffed8ff9700 (LWP 139453)] Overwriting default ffn_hidden_size value None with value from checkpoint 11008. Overwriting default kv_channels value None with value from checkpoint 128. Overwriting default use_rotary_position_embeddings value False with value from checkpoint True. Overwriting default normalization value LayerNorm with value from checkpoint RMSNorm. Overwriting default norm_epsilon value 1e-05 with value from checkpoint 1e-06. Overwriting default swiglu value False with value from checkpoint True. Overwriting default global_batch_size value None with value from checkpoint 1024. Overwriting default dataloader_type value None with value from checkpoint single. Overwriting default load value None with value from checkpoint ./model_from_hf/Qwen-7B-Chat. Overwriting default overlap_p2p_comm value True with value from checkpoint False. Overwriting default vocab_size value None with value from checkpoint 151936. Overwriting default padded_vocab_size value None with value from checkpoint 151936. Overwriting default add_qkv_bias value False with value from checkpoint True. Checkpoint had argument iteration but new arguments does not have this. Checkpoint had argument llama but new arguments does not have this. Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this. Checkpoint had argument data_parallel_size but new arguments does not have this. Checkpoint had argument consumed_train_samples but new arguments does not have this. Checkpoint had argument consumed_valid_samples but new arguments does not have this. Checkpoint had argument disable_bias_linear but new arguments does not have this. Checkpoint had argument model_type but new arguments does not have this. using world size: 4, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 1 WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication using torch.float16 for parameters ... ------------------------ arguments ------------------------ [New Thread 0x7ffed87f8700 (LWP 139454)] accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adaptive_recompute_device_size .................. -1 adaptive_recompute_device_swap .................. False adaptive_recompute_profiling_step ............... 10 add_bias_linear ................................. False add_dense_bias .................................. False add_position_embedding .......................... True add_qkv_bias .................................... True adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. False bias_gelu_fusion ................................ False biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 1 data_path ....................................... None data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None delay_grad_reduce ............................... True delay_param_gather .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embed_layernorm ................................. False embedding_path .................................. None empty_unused_memory_level ....................... 0 encoder_num_layers .............................. 32 encoder_seq_length .............................. 8192 end_weight_decay ................................ 0.01 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 100 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_interval ................................. 1 expert_model_parallel_size ...................... 1 ffn_hidden_size ................................. 11008 fill_neg_inf .................................... False finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 1024 gradient_accumulation_fusion .................... False group_query_attention ........................... False head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.1 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 is_instruction_dataset .......................... False iter_per_epoch .................................. 1250 kv_channels ..................................... 128 lazy_mpu_init ................................... None load ............................................ ./model_from_hf/Qwen-7B-Chat local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 100 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_throughput .................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False lora_alpha ...................................... 32 lora_load ....................................... None lora_modules_to_save ............................ None lora_r .......................................... 16 lora_register_forward_hook ...................... ['word_embeddings', 'input_layernorm'] lora_target_modules ............................. [] loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. None lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. linear lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 1 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 8192 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 moe_aux_loss_coeff .............................. 0.0 moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_train_capacity_factor ....................... 1.0 moe_z_loss_coeff ................................ 0.0 nccl_communicator_config_path ................... None next_tockens .................................... 0 no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... True no_save_rng ..................................... True noisy_gate_policy ............................... None norm_epsilon .................................... 1e-06 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... None num_layer_list .................................. None num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 1 num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimize_recomp_communication_level ............. 0 optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False padded_vocab_size ............................... 151936 params_dtype .................................... torch.float16 patch_dim ....................................... 16 perform_initialization .......................... False pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pre_tockens ..................................... 65536 profile ......................................... False profile_level ................................... level0 profile_ranks ................................... [0] profile_record_shapes ........................... False profile_save_path ............................... ./profile_dir profile_step_end ................................ 12 profile_step_start .............................. 10 profile_with_cpu ................................ False profile_with_memory ............................. False profile_with_stack .............................. False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_return_doc_ids ............................ False retro_verify_neighbor_count ..................... True retro_workdir ................................... None reuse_fp32_param ................................ False rotary_base ..................................... None rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ ./model_weights/Qwen-7B-v0.1-tp4-pp1 save_interval ................................... 1 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 8192 sequence_parallel ............................... False sgd_momentum .................................... 0.9 shape_order ..................................... SBH short_seq_prob .................................. 0.1 skip_bias_add ................................... True skip_train ...................................... False spec ............................................ None split ........................................... 969, 30, 1 square_alibi_mask ............................... False squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.01 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 4 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_kwargs ................................ None tokenizer_model ................................. None tokenizer_name_or_path .......................... None tokenizer_not_use_fast .......................... True tokenizer_padding_side .......................... right tokenizer_type .................................. PretrainedFromHF tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_cfg ............................. None tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... None train_samples ................................... None transformer_impl ................................ local transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... True use_distributed_optimizer ....................... False use_flash_attn .................................. False use_fused_rmsnorm ............................... False use_fused_rotary_pos_emb ........................ False use_fused_swiglu ................................ False use_mcore_models ................................ False use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. True valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... 151936 w_pack .......................................... False wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.01 weight_decay_incr_style ......................... constant world_size ...................................... 4 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1024 Setting consumed_train_samples to 0 and consumed_valid_samples to 0 [New Thread 0x7ffe7b7fe700 (LWP 139494)] --Type <RET> for more, q to quit, c to continue without paging-- Thread 185 "python3" received signal SIGBUS, Bus error. [Switching to Thread 0x7ffecaffd700 (LWP 139463)] 0x00007fffe52057c3 in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#1}::operator()() const::{lambda(unsigned char)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<unsigned char>)#2}> >(long, char**, long const*, long, long) () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so (gdb) bt #0 0x00007fffe52057c3 in void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<at::native::AVX2::VectorizedLoop2d<at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#1}::operator()() const::{lambda(unsigned char)#1}, at::native::AVX2::direct_copy_kernel(at::TensorIteratorBase&)::{lambda()#3}::operator()() const::{lambda()#3}::operator()() const::{lambda(at::vec::AVX2::Vectorized<unsigned char>)#2}> >(long, char**, long const*, long, long) () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #1 0x00007fffe0a20cee in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #2 0x00007fffe0a20e22 in void at::internal::invoke_parallel<at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1}>(long, long, long, at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1} const&) [clone ._omp_fn.0] () from /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so #3 0x00007ffff6efa405 in ?? () from /usr/local/lib/python3.8/dist-packages/torch/lib/libgomp-a34b3233.so.1 #4 0x00007ffff7db1609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #5 0x00007ffff7eeb353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) quit A debugging session is active. Inferior 1 [process 138870] will be killed. Quit anyway? (y or n) y root @ /home/watrix/data/tiandk/ModelLink # Process ForkServerPoolWorker-3: Process ForkServerPoolWorker-9: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/pool.py", line 131, in worker put((job, i, result)) File "/usr/lib/python3.8/multiprocessing/queues.py", line 368, in put self._writer.send_bytes(obj) File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe During handling of the above exception, another exception occurred: BrokenPipeError: [Errno 32] Broken pipe /usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 97 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
评论 (
2
)
登录
后才可以发表评论
状态
DONE
TODO
WIP
DONE
CLOSED
REJECTED
负责人
未设置
标签
未设置
项目
未立项任务
未立项任务
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
分支 (8)
标签 (6)
2.1.0
master
2.0.0
1.0.0
1.0.RC3
1.0.RC2
1.0.RC1
bk_origin_23
v2.1.0
v2.0.0
v1.0.0
v1.0.RC3.0
v1.0.RC2.0
v1.0.RC1.0
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
预计工期
(小时)
参与者(1)
Python
1
https://gitee.com/ascend/MindSpeed-LLM.git
git@gitee.com:ascend/MindSpeed-LLM.git
ascend
MindSpeed-LLM
MindSpeed-LLM
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
评论
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册