accelerator.prepare(model) hangs when Multi-node training on A100 machines.

See original GitHub issue

System Info

2 servers, 2 A100 gpu on each server

accelerate = 0.13.2
torch = 1.10.0+cu111
transformers = 4.23.1

lsb_release -a
LSB Version:    core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:        18.04
Codename:       bionic

nvcc -V:
release 11.0, V11.0.221

nvidia-smi:
NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4

accelerate env

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: xx.xx.xx.xx
main_process_port: 9988
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Reference Code: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py For convenience, I cut unnecessary code fragments


#!/usr/bin/env python
# coding=utf-8
import argparse
import json
import logging
import math
import os
import random
from itertools import chain
from pathlib import Path

import datasets
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

import transformers
from accelerate import Accelerator, DistributedType
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from huggingface_hub import Repository
from transformers import (
    CONFIG_MAPPING,
    MODEL_MAPPING,
    AutoConfig,
    AutoModelForMaskedLM,
    SchedulerType
)
from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry
from transformers.utils.versions import require_version


# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.22.1")

logger = get_logger(__name__)
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


def parse_args():
    parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task")
    parser.add_argument(
        "--config_name",
        type=str,
        default=None,
        help="Pretrained config name or path if not the same as model_name",
    )

    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        required=False,
    )

    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    args = parser.parse_args()

    return args


def main():
    args = parse_args()
    accelerator_log_kwargs = {}

    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs)
    if accelerator.is_local_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    accelerator.wait_for_everyone()

    if args.config_name:
        config = AutoConfig.from_pretrained(args.config_name)
    elif args.model_name_or_path:
        config = AutoConfig.from_pretrained(args.model_name_or_path)
    else:
        config = CONFIG_MAPPING[args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")

    if args.model_name_or_path:
        model = AutoModelForMaskedLM.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
        )
    else:
        logger.info("Training new model from scratch")
        model = AutoModelForMaskedLM.from_config(config)

    print('accelerator.prepare(model) start')
    model = accelerator.prepare(model)
    print('accelerator.prepare(model) end')


if __name__ == "__main__":
    main()


# run command:  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

# bert_pretrain/config.json is  {"architectures": ["BertForMaskedLM"], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 2, "pad_token_id": 0, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.23.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 44000}

# Note: This is code work just to initialize a bert model and synchronize, so the config file can theoretically be a specific configuration

Expected behavior

We assume that there are 2 machines that are the primary node A and the secondary node B

A: 
  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

B: 
  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

Then:

    On A: 
     The above program will print out:
        accelerator.prepare(model) start
        accelerator.prepare(model) end

But On B:
     The above program will print out:
        accelerator.prepare(model) start

     After that, the program will keep hanging

I tried without NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1, but the result is still the same


Next, I try to end the training process on server A, it outputs the following error message:

  "WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5732 closing signal SIGINT"

At the same time, on the server B, the following information is output:

  Traceback (most recent call last):
  File "test_run_mlm_no_trainer.py", line 102, in <module>
    main()
  File "test_run_mlm_no_trainer.py", line 96, in main
    model = accelerator.prepare(model)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in prepare
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 556, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 721, in prepare_model
    model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5510) of binary: /usr/bin/python3.7
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.0010552406311035156 seconds
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 904, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Broken pipe

The program hangs when it runs to dist._verify_model_across_ranks(self.process_group, parameters) 

I can't find any information to solve this problem elsewhere, including the official website, any help is greatly appreciated

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
pacman100commented, Nov 2, 2022

Hello @biandh, it does support using one/few gpus per node based on num_processes, i.e., it uses num_processes//num_nodes gpus per node. I’ll try to see if I face this issue.

0reactions
biandhcommented, Nov 2, 2022

In the above accelerate config file,I set num_processes: 2, I thought it represented the number of GPUs per node, but what it really means is the total number of GPUs, so here it is wrong!!! 😓😓 When I set num_processes: 4, the code runs successfully !
Maybe accelerate doesn’t support multiple nodes with only one GPU ? hope no one makes the same mistake as me 😂😂

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-node training hangs at accelerator.prepare(model) #412
Recently I'm trying to launch multi-node distributed training using on two servers accelerate, but the training always hangs at ...
Read more >
Quick tour - Hugging Face
Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >
latest PDF - MMDetection's documentation!
It is common to initialize from backbone models pre-trained on ImageNet classification task. All pre-trained model.
Read more >
A Full Hardware Guide to Deep Learning - Tim Dettmers
In this guide I analyse hardware from CPU to SSD and their impact on performance for deep learning so that you can choose...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found