accelerator.prepare(model) hangs when Multi-node training on A100 machines.
See original GitHub issueSystem Info
2 servers, 2 A100 gpu on each server
accelerate = 0.13.2
torch = 1.10.0+cu111
transformers = 4.23.1
lsb_release -a
LSB Version: core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
nvcc -V:
release 11.0, V11.0.221
nvidia-smi:
NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
accelerate env
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: xx.xx.xx.xx
main_process_port: 9988
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Reference Code: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py For convenience, I cut unnecessary code fragments
#!/usr/bin/env python
# coding=utf-8
import argparse
import json
import logging
import math
import os
import random
from itertools import chain
from pathlib import Path
import datasets
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import transformers
from accelerate import Accelerator, DistributedType
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from huggingface_hub import Repository
from transformers import (
CONFIG_MAPPING,
MODEL_MAPPING,
AutoConfig,
AutoModelForMaskedLM,
SchedulerType
)
from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry
from transformers.utils.versions import require_version
# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.22.1")
logger = get_logger(__name__)
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
def parse_args():
parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task")
parser.add_argument(
"--config_name",
type=str,
default=None,
help="Pretrained config name or path if not the same as model_name",
)
parser.add_argument(
"--model_name_or_path",
type=str,
help="Path to pretrained model or model identifier from huggingface.co/models.",
required=False,
)
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help="Number of updates steps to accumulate before performing a backward/update pass.",
)
args = parser.parse_args()
return args
def main():
args = parse_args()
accelerator_log_kwargs = {}
accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs)
if accelerator.is_local_main_process:
datasets.utils.logging.set_verbosity_warning()
transformers.utils.logging.set_verbosity_info()
else:
datasets.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()
accelerator.wait_for_everyone()
if args.config_name:
config = AutoConfig.from_pretrained(args.config_name)
elif args.model_name_or_path:
config = AutoConfig.from_pretrained(args.model_name_or_path)
else:
config = CONFIG_MAPPING[args.model_type]()
logger.warning("You are instantiating a new config instance from scratch.")
if args.model_name_or_path:
model = AutoModelForMaskedLM.from_pretrained(
args.model_name_or_path,
from_tf=bool(".ckpt" in args.model_name_or_path),
config=config,
)
else:
logger.info("Training new model from scratch")
model = AutoModelForMaskedLM.from_config(config)
print('accelerator.prepare(model) start')
model = accelerator.prepare(model)
print('accelerator.prepare(model) end')
if __name__ == "__main__":
main()
# run command: NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/
# bert_pretrain/config.json is {"architectures": ["BertForMaskedLM"], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 2, "pad_token_id": 0, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.23.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 44000}
# Note: This is code work just to initialize a bert model and synchronize, so the config file can theoretically be a specific configuration
Expected behavior
We assume that there are 2 machines that are the primary node A and the secondary node B
A:
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/
B:
NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/
Then:
On A:
The above program will print out:
accelerator.prepare(model) start
accelerator.prepare(model) end
But On B:
The above program will print out:
accelerator.prepare(model) start
After that, the program will keep hanging
I tried without NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1, but the result is still the same
Next, I try to end the training process on server A, it outputs the following error message:
"WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5732 closing signal SIGINT"
At the same time, on the server B, the following information is output:
Traceback (most recent call last):
File "test_run_mlm_no_trainer.py", line 102, in <module>
main()
File "test_run_mlm_no_trainer.py", line 96, in main
model = accelerator.prepare(model)
File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in prepare
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 556, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 721, in prepare_model
model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5510) of binary: /usr/bin/python3.7
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.0010552406311035156 seconds
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 904, in _exit_barrier
barrier_timeout=self._exit_barrier_timeout,
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Broken pipe
The program hangs when it runs to dist._verify_model_across_ranks(self.process_group, parameters)
I can't find any information to solve this problem elsewhere, including the official website, any help is greatly appreciated
Issue Analytics
- State:
- Created a year ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Multi-node training hangs at accelerator.prepare(model) #412
Recently I'm trying to launch multi-node distributed training using on two servers accelerate, but the training always hangs at ...
Read more >Quick tour - Hugging Face
Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...
Read more >Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >latest PDF - MMDetection's documentation!
It is common to initialize from backbone models pre-trained on ImageNet classification task. All pre-trained model.
Read more >A Full Hardware Guide to Deep Learning - Tim Dettmers
In this guide I analyse hardware from CPU to SSD and their impact on performance for deep learning so that you can choose...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hello @biandh, it does support using one/few gpus per node based on
num_processes, i.e., it usesnum_processes//num_nodesgpus per node. I’ll try to see if I face this issue.In the above accelerate config file,I set num_processes: 2, I thought it represented the number of GPUs per node, but what it really means is the total number of GPUs, so here it is wrong!!! 😓😓 When I set num_processes: 4, the code runs successfully !
Maybe accelerate doesn’t support multiple nodes with only one GPU ? hope no one makes the same mistake as me 😂😂