Training hangs at the very start while using deepspeed

Environment info

transformers version: 4.4.0
base docker image: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
Python version: 3.8.8
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): 2.2.1 (True)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes, using deepspeed

Who can help

@stas00 for deepspeed

Information

Model I am using Layoutlm:

I need to test my layoutlm model by training it only 1 epoch due to test purposes. However, training hangs at the very start without logging anything or returning an error message. When I disable deepspeed and launch my training with python -m torch.distributed.launch instead of deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1, I manage to train for 1 epoch.

The tasks I am working on is:

Token Classification

To reproduce

I think it is a general issue. So, training any model with deepspeed for only one epoch may result in hanging process.

Expected behavior

It would be possible to train a model only for 1 epoch not to waste time while testing.

Issue Analytics

State:
Created 2 years ago
Comments:18 (14 by maintainers)

Top GitHub Comments

1reaction

hasansalimkanmazcommented, Aug 13, 2021

Thanks @stas00 for your kind help. Currently, I don’t have time to dive into this issue as I manage to run in a distributed setting without deepspeed, it is not so urgent for now. On the other hand, I will be working on this issue in the next coming weeks.

1reaction

stas00commented, Aug 6, 2021

So you have a syncing problem, the 2 gpus run barrier which ensures they arrived to the same point, but one of the gpus doesn’t, and so the other is stuck waiting for it.

Are you by chance misconfiguring the launch command? Try to hardcode 2 here:

deepspeed --num_gpus={torch.cuda.device_count()} --num_nodes=1

could {torch.cuda.device_count() be returning a different number than 2?

i.e.:

deepspeed --num_gpus=2 --num_nodes=1

Top Results From Across the Web

DeepSpeed gets stuck when training · Issue #12418 - GitHub

Information. Trying to replicate this, I am using a 125M GPT Neo model and fine-tune it with using the Trainer. Training arguments include...

DeepSpeed Integration - Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Train 1 trillion+ parameter models - PyTorch Lightning

Train 1 trillion+ parameter models. When training large models, fitting larger batch sizes, or trying to increase throughput using multi-GPU compute, ...

Flops Profiler - DeepSpeed

DeepSpeed Flops Profiler helps users easily measure both the model training/inference speed (latency, throughput) and efficiency (floating-point ...

Submit Experiment - Determined AI Documentation

Exit zero after all workers exit zero. These requirements ensure that distributed training jobs do not hang after a single worker failure. Configure...