Training hangs in the end while calling dist.barrier()

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.0-1073-azure-x86_64-with-glibc2.27
- Python version: 3.8.0
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.10.1+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: DDP

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I am working on a custom TokenClassificationTask. For a specific model type, I am experiencing a hanging process at the end of the training. After I set TORCH_DISTRIBUTED_DEBUG=DETAIL and also added rank numbers to logs (to do this, I overwrote train method of Trainer class with additional loggings), the training failed and I received the below stack trace.

Training completed for rank 6. Do not forget to share your model on huggingface.co/models =)

2022-05-30 15:54:58 INFO     nlp_ner_layoutlm.layoutlm.trainers.re_trainer Before barrier for rank 6
2022-05-30 15:54:58 INFO     nlp_ner_layoutlm.layoutlm.trainers.re_trainer Entering into barrier for rank 6
2022-05-30 15:54:59 INFO     transformers.modeling_utils Model weights saved in ./data/tmpm8wxl12l/checkpoint-590/pytorch_model.bin
2022-05-30 15:55:01 INFO     transformers.trainer Deleting older checkpoint [data/tmpm8wxl12l/checkpoint-585] due to args.save_total_limit
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
Traceback (most recent call last):
  File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
    train_model(
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
    raise e
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
    trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
  File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 698, in train
    dist.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Detected mismatch between collectives on ranks. Rank 6 is running inconsistent collective: CollectiveFingerPrint(OpType=BARRIER
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
  File "nlp_ner_layoutlm/train_pipeline/training_step/training_script.py", line 53, in <module>
    train_model(
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 160, in train_model
    raise e
  File "/app/nlp_ner_layoutlm/layoutlm/utils/training_utils.py", line 158, in train_model
    trainer.train(resume_from_checkpoint=get_last_checkpoint(checkpoint_dir))
  File "/app/nlp_ner_layoutlm/layoutlm/trainers/re_trainer.py", line 603, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2011, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2043, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Detected mismatch between collectives on ranks. Rank 3 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
2022-05-30 15:55:01 ERROR    __main__   Detected mismatch between collectives on ranks. Rank 1 is running inconsistent collective: CollectiveFingerPrint(OpType=BROADCAST, TensorShape=[514], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))
Traceback (most recent call last):
... (Same error for other processes)

According to the trace, while the process with rank 6 is running dist.barrier() from trainer.py line 1536, the other processes are running a forward_call. I think this is the issue and due to this mis-communication the training hangs. When I checked similar issues on the web, I came across this issue from speechbrain. It is exactly the same issue and they fixed it with a PR. Currently, I can’t understand why the processes are ending up in different places in the code and can’t figure out how to fix this issue.

Expected behavior

As far as I understand, the processes should meet at `dist.barrier()` and training should succeed. Could you help me or point me to a fix that I can work on it?

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

Shreyas-bilagicommented, Sep 19, 2022

I have just tested my fix and concluded that it is related to what I mentioned above. Thanks for your time @sgugger. I am closing the issue.

Hello, Could you please elaborate the solution?

1reaction

hasansalimkanmazcommented, Jun 1, 2022

I think I have found the issue, my custom model has outputs with variable lengths and I wasn’t gathering all outputs with distributed_concat function as they are not torch tensors. This results in different metrics in each process due to different outputs without gathering. In addition, I am using EarlyStoppingCallback during my training. As the metrics are different for each process, one process can stop the training and enter dist.barrier while the others go on training. This results in hanging training.

Until now, I haven’t implemented the fix yet. After the implementation, I will confirm here and close the issue. Thanks for your time anyway.