断点续训加速 整体结果 训练环境 服务器型号 Atlas 800T A2 NPU型号 910B2(64GB) 驱动版本 23.0.3 CANN 8.0.RC2 Python 3.10.14 MindSpore 2.3.0 Mindformers 1.2.0 训练配置 Epochs 350 Learning Rate 6.e5 Global Batch Size 32768 Batch Size 1 Micro Batch Size 128 Sequence Length 4096 Data Parallel (DP) 256 Model Parallel (MP) 4 Pipeline Parallel (PP) 9 maxdevicememory 54GB jitlevel O2 训练结果 吞吐量(tokens/s/p) 366.915 MFU 芯片算力(%) 43.061 MFU CUBE算力(%) 45.867 断点续训 断点CheckPoint总大小:22T,其中0卡断点CheckPoint大小:2.9G。 故障1:业务故障,kill所有python进程 故障检测时间(Min) 7.2s 故障处理耗时(Min) 231.7s, 3.86min 故障恢复耗时(Min) 458s, 7.63min CKPT加载时间(Min) 0.28min 0卡CKPT加载速度(GB/s) 0.99 故障2:节点心跳故障,把node上label去掉 故障检测时间(Min) 18.9s, 0.315min 故障处理耗时(Min) 279.8s, 4.64min 故障恢复耗时(Min) 478s, 7.96min 0卡CKPT加载时间(Min) 0.3min 0卡CKPT加载速度(GB/s) 1.01 故障3:节点down故障,reboot 故障检测时间(Min) 18.3s, 0.3min 故障处理耗时(Min) 465s, 7.75min 故障恢复耗时(Min) 546s, 9.1min 0卡CKPT加载时间(Min) 0.98min 0卡CKPT加载速度(GB/s) 0.1 故障4:网络故障,网卡link down 故障检测时间(Min) 895s(600s HCCL), 14.9min 故障处理耗时(Min) 300s, 5min 故障恢复耗时(Min) 472.1s, 7.86min 0卡CKPT加载时间(Min) 0.32min 0卡CKPT加载速度(GB/s) 0.96 故障5:PCIE故障,模拟NPU掉卡 故障检测时间(Min) 78.7s, 1.3min 故障处理耗时(Min) 267.1s, 4.4min 故障恢复耗时(Min) 516.9s, 8.6min 0卡CKPT加载时间(Min) 0.32min 0卡CKPT加载速度(GB/s) 0.98