Training hangs in the end while calling dist.barrier()

See original GitHub issue

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.0-1073-azure-x86_64-with-glibc2.27
- Python version: 3.8.0
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.10.1+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: DDP

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I am working on a custom TokenClassificationTask. For a specific model type, I am experiencing a hanging process at the end of the training. After I set TORCH_DISTRIBUTED_DEBUG=DETAIL and also added rank numbers to logs (to do this, I overwrote train method of Trainer class with additional loggings), the training failed and I received the below stack trace.

Training completed for rank 6. Do not forget to share your model on huggingface.co/models =)

2022-05-30 15:54:58 INFO     nlp_ner_layoutlm.layoutlm.trainers.re_trainer Before barrier for rank 6
2022-05-30 15:54:58 INFO     nlp_ner_layoutlm.layoutlm.trainers.re_trainer Entering into barrier for rank 6
2022-05-30 15:54:59 INFO     transformers.modeling_utils Model weights saved in ./data/tmpm8wxl12l/checkpoint-590/pytorch_model.bin
2022-05-30 15:55:01 INFO     transformers.trainer Deleting older checkpoint [data/tmpm8wxl12l/checkpoint-585] due to args.save_total_limit
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
Traceback (most recent call last):
  File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
    train_model(
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
    raise e
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
    trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
  File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 698, in train
    dist.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
  File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
    train_model(
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
    raise e
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
    trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
  File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 603, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2011, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2043, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
... (Same error for other processes)

According to the trace, while the process with rank 6 is running dist.barrier() from trainer.py line 1536, the other processes are running a forward_call. I think this is the issue and due to this mis-communication the training hangs. When I checked similar issues on the web, I came across this issue from speechbrain. It is exactly the same issue and they fixed it with a PR. Currently, I can’t understand why the processes are ending up in different places in the code and can’t figure out how to fix this issue.

Expected behavior

As far as I understand, the processes should meet at `dist.barrier()` and training should succeed. Could you help me or point me to a fix that I can work on it?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Shreyas-bilagicommented, Sep 19, 2022

I have just tested my fix and concluded that it is related to what I mentioned above. Thanks for your time @sgugger. I am closing the issue.

Hello, Could you please elaborate the solution?

1reaction
hasansalimkanmazcommented, Jun 1, 2022

I think I have found the issue, my custom model has outputs with variable lengths and I wasn’t gathering all outputs with distributed_concat function as they are not torch tensors. This results in different metrics in each process due to different outputs without gathering. In addition, I am using EarlyStoppingCallback during my training. As the metrics are different for each process, one process can stop the training and enter dist.barrier while the others go on training. This results in hanging training.

Until now, I haven’t implemented the fix yet. After the implementation, I will confirm here and close the issue. Thanks for your time anyway.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Torch.distributed.barrier() hangs in DDP - PyTorch Forums
Hi everyone. I am currently using DDP (NCCL backend) to train a network on a machine with 8 GPUs. I do a validation...
Read more >
How does torch.distributed.barrier() work - Stack Overflow
This collective blocks processes until the whole group enters this function, if async_op is False, or if async work handle is called on...
Read more >
PyTorch Distributed Training - Lei Mao's Log Book
In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using ...
Read more >
Distributed training hanging - Troubleshooting - Guild AI
I'm trying to run distributed training (DDP) on my 1 machine with 4 GPUs. It works fine if I just run it normally...
Read more >
LLVM Language Reference Manual
The LLVM representation aims to be light-weight and low-level while being ... Definition of main function define i32 @main() { ; Call puts...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found