Training hangs at the very start while using deepspeed

See original GitHub issue

Environment info

  • transformers version: 4.4.0
  • base docker image: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): 2.2.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes, using deepspeed

Who can help

@stas00 for deepspeed

Information

Model I am using Layoutlm:

I need to test my layoutlm model by training it only 1 epoch due to test purposes. However, training hangs at the very start without logging anything or returning an error message. When I disable deepspeed and launch my training with python -m torch.distributed.launch instead of deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1, I manage to train for 1 epoch.

The tasks I am working on is:

  • Token Classification

To reproduce

I think it is a general issue. So, training any model with deepspeed for only one epoch may result in hanging process.

Expected behavior

It would be possible to train a model only for 1 epoch not to waste time while testing.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
hasansalimkanmazcommented, Aug 13, 2021

Thanks @stas00 for your kind help. Currently, I don’t have time to dive into this issue as I manage to run in a distributed setting without deepspeed, it is not so urgent for now. On the other hand, I will be working on this issue in the next coming weeks.

1reaction
stas00commented, Aug 6, 2021

So you have a syncing problem, the 2 gpus run barrier which ensures they arrived to the same point, but one of the gpus doesn’t, and so the other is stuck waiting for it.

Are you by chance misconfiguring the launch command? Try to hardcode 2 here:

deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1

could {torch.cuda.device_count() be returning a different number than 2?

i.e.:

deepspeed --num_gpus=2 --num_nodes=1
Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed gets stuck when training · Issue #12418 - GitHub
Information. Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include...
Read more >
DeepSpeed Integration - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
Train 1 trillion+ parameter models. When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, ...
Read more >
Flops Profiler - DeepSpeed
DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point ...
Read more >
Submit Experiment - Determined AI Documentation
Exit zero after all workers exit zero. These requirements ensure that distributed training jobs do not hang after a single worker failure. Configure...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found