@lnwbzmt
lnwbzmt 暂无简介
fault injection for pytorch by torch_distpatch
TensorProbe (code name: kj600) is a LLM pretrain debugger with model's torch module , optimizer status, collective communication tensor collection and aggregation. It also supports rule-based alerts.
a LLM pretrain early warning system with built-in model's torch module and optimizer status collection and aggregation.
markdown 图片管理