Training hangs at the very start while using deepspeed
See original GitHub issueEnvironment info
transformersversion: 4.4.0- base docker image: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
- Python version: 3.8.8
- PyTorch version (GPU?): 1.7.1 (True)
- Tensorflow version (GPU?): 2.2.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes, using deepspeed
Who can help
@stas00 for deepspeed
Information
Model I am using Layoutlm:
I need to test my layoutlm model by training it only 1 epoch due to test purposes. However, training hangs at the very start without logging anything or returning an error message. When I disable deepspeed and launch my training with python -m torch.distributed.launch instead of deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1, I manage to train for 1 epoch.
The tasks I am working on is:
- Token Classification
To reproduce
I think it is a general issue. So, training any model with deepspeed for only one epoch may result in hanging process.
Expected behavior
It would be possible to train a model only for 1 epoch not to waste time while testing.
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (14 by maintainers)
Top Results From Across the Web
DeepSpeed gets stuck when training · Issue #12418 - GitHub
Information. Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include...
Read more >DeepSpeed Integration - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >Train 1 trillion+ parameter models - PyTorch Lightning
Train 1 trillion+ parameter models. When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, ...
Read more >Flops Profiler - DeepSpeed
DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point ...
Read more >Submit Experiment - Determined AI Documentation
Exit zero after all workers exit zero. These requirements ensure that distributed training jobs do not hang after a single worker failure. Configure...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks @stas00 for your kind help. Currently, I don’t have time to dive into this issue as I manage to run in a distributed setting without deepspeed, it is not so urgent for now. On the other hand, I will be working on this issue in the next coming weeks.
So you have a syncing problem, the 2 gpus run
barrierwhich ensures they arrived to the same point, but one of the gpus doesn’t, and so the other is stuck waiting for it.Are you by chance misconfiguring the launch command? Try to hardcode
2here:could
{torch.cuda.device_count()be returning a different number than 2?i.e.: