Training hangs in the end while calling dist.barrier()
See original GitHub issueSystem Info
- `transformers` version: 4.18.0
- Platform: Linux-5.4.0-1073-azure-x86_64-with-glibc2.27
- Python version: 3.8.0
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.10.1+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: DDP
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I am working on a custom TokenClassificationTask. For a specific model type, I am experiencing a hanging process at the end of the training. After I set TORCH_DISTRIBUTED_DEBUG=DETAIL and also added rank numbers to logs (to do this, I overwrote train method of Trainer class with additional loggings), the training failed and I received the below stack trace.
Training completed for rank 6. Do not forget to share your model on huggingface.co/models =)
2022-05-30 15:54:58 INFO nlp_ner_layoutlm.layoutlm.trainers.re_trainer Before barrier for rank 6
2022-05-30 15:54:58 INFO nlp_ner_layoutlm.layoutlm.trainers.re_trainer Entering into barrier for rank 6
2022-05-30 15:54:59 INFO transformers.modeling_utils Model weights saved in ./data/tmpm8wxl12l/checkpoint-590/pytorch_model.bin
2022-05-30 15:55:01 INFO transformers.trainer Deleting older checkpoint [data/tmpm8wxl12l/checkpoint-585] due to args.save_total_limit
2022-05-30 15:55:01 ERROR __main__ Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
Traceback (most recent call last):
File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
train_model(
File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
raise e
File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 698, in train
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
2022-05-30 15:55:01 ERROR __main__ Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
train_model(
File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
raise e
File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 603, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2011, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2043, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 878, in forward
self._sync_params()
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1379, in _sync_params
self._distributed_broadcast_coalesced(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
2022-05-30 15:55:01 ERROR __main__ Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
... (Same error for other processes)
According to the trace, while the process with rank 6 is running dist.barrier() from trainer.py line 1536, the other processes are running a forward_call. I think this is the issue and due to this mis-communication the training hangs. When I checked similar issues on the web, I came across this issue from speechbrain. It is exactly the same issue and they fixed it with a PR. Currently, I can’t understand why the processes are ending up in different places in the code and can’t figure out how to fix this issue.
Expected behavior
As far as I understand, the processes should meet at `dist.barrier()` and training should succeed. Could you help me or point me to a fix that I can work on it?
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top Related StackOverflow Question
Hello, Could you please elaborate the solution?
I think I have found the issue, my custom model has outputs with variable lengths and I wasn’t gathering all outputs with
distributed_concatfunction as they are not torch tensors. This results in different metrics in each process due to different outputs without gathering. In addition, I am usingEarlyStoppingCallbackduring my training. As the metrics are different for each process, one process can stop the training and enterdist.barrierwhile the others go on training. This results in hanging training.Until now, I haven’t implemented the fix yet. After the implementation, I will confirm here and close the issue. Thanks for your time anyway.