@lnwbzmt
lnwbzmt 暂无简介
昇腾LLM分布式训练框架
TensorProbe (code name: kj600) is a LLM pretrain debugger with model's torch module , optimizer status, collective communication tensor collection and aggregation. It also supports rule-based alerts.