accelerator.prepare(model) hangs when Multi-node training on A100 machines.

System Info

2 servers, 2 A100 gpu on each server

accelerate = 0.13.2
torch = 1.10.0+cu111
transformers = 4.23.1

lsb_release -a
LSB Version:    core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:        18.04
Codename:       bionic

nvcc -V:
release 11.0, V11.0.221

nvidia-smi:
NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4

accelerate env

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: xx.xx.xx.xx
main_process_port: 9988
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Reference Code: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py For convenience, I cut unnecessary code fragments


#!/usr/bin/env python
# coding=utf-8
import argparse
import json
import logging
import math
import os
import random
from itertools import chain
from pathlib import Path

import datasets
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm

import transformers
from accelerate import Accelerator, DistributedType
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from huggingface_hub import Repository
from transformers import (
    CONFIG_MAPPING,
    MODEL_MAPPING,
    AutoConfig,
    AutoModelForMaskedLM,
    SchedulerType
)
from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry
from transformers.utils.versions import require_version


# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.22.1")

logger = get_logger(__name__)
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


def parse_args():
    parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task")
    parser.add_argument(
        "--config_name",
        type=str,
        default=None,
        help="Pretrained config name or path if not the same as model_name",
    )

    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        required=False,
    )

    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    args = parser.parse_args()

    return args


def main():
    args = parse_args()
    accelerator_log_kwargs = {}

    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs)
    if accelerator.is_local_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    accelerator.wait_for_everyone()

    if args.config_name:
        config = AutoConfig.from_pretrained(args.config_name)
    elif args.model_name_or_path:
        config = AutoConfig.from_pretrained(args.model_name_or_path)
    else:
        config = CONFIG_MAPPING[args.model_type]()
        logger.warning("You are instantiating a new config instance from scratch.")

    if args.model_name_or_path:
        model = AutoModelForMaskedLM.from_pretrained(
            args.model_name_or_path,
            from_tf=bool(".ckpt" in args.model_name_or_path),
            config=config,
        )
    else:
        logger.info("Training new model from scratch")
        model = AutoModelForMaskedLM.from_config(config)

    print('accelerator.prepare(model) start')
    model = accelerator.prepare(model)
    print('accelerator.prepare(model) end')


if __name__ == "__main__":
    main()


# run command:  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

# bert_pretrain/config.json is  {"architectures": ["BertForMaskedLM"], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 2, "pad_token_id": 0, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.23.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 44000}

# Note: This is code work just to initialize a bert model and synchronize, so the config file can theoretically be a specific configuration

Expected behavior

We assume that there are 2 machines that are the primary node A and the secondary node B

A: 
  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

B: 
  NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 accelerate launch test_run_mlm_no_trainer.py --config_name bert_pretrain/

Then:

    On A: 
     The above program will print out：
        accelerator.prepare(model) start
        accelerator.prepare(model) end

But On B:
     The above program will print out：
        accelerator.prepare(model) start

     After that, the program will keep hanging

I tried without NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1, but the result is still the same


Next, I try to end the training process on server A, it outputs the following error message:

  "WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5732 closing signal SIGINT"

At the same time, on the server B, the following information is output：

  Traceback (most recent call last):
  File "test_run_mlm_no_trainer.py", line 102, in <module>
    main()
  File "test_run_mlm_no_trainer.py", line 96, in main
    model = accelerator.prepare(model)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in prepare
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 682, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 556, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py", line 721, in prepare_model
    model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: Connection reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5510) of binary: /usr/bin/python3.7
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.0010552406311035156 seconds
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 904, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Broken pipe

The program hangs when it runs to dist._verify_model_across_ranks(self.process_group, parameters) 

I can't find any information to solve this problem elsewhere, including the official website, any help is greatly appreciated

Issue Analytics

State:
Created a year ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

pacman100commented, Nov 2, 2022

Hello @biandh, it does support using one/few gpus per node based on num_processes, i.e., it uses num_processes//num_nodes gpus per node. I’ll try to see if I face this issue.

0reactions

biandhcommented, Nov 2, 2022

In the above accelerate config file，I set num_processes: 2, I thought it represented the number of GPUs per node, but what it really means is the total number of GPUs, so here it is wrong!!! 😓😓 When I set num_processes: 4, the code runs successfully !
Maybe accelerate doesn’t support multiple nodes with only one GPU ? hope no one makes the same mistake as me 😂😂

Top Results From Across the Web

Multi-node training hangs at accelerator.prepare(model) #412

Recently I'm trying to launch multi-node distributed training using on two servers accelerate, but the training always hangs at ...

Quick tour - Hugging Face

Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...

latest PDF - MMDetection's documentation!

It is common to initialize from backbone models pre-trained on ImageNet classification task. All pre-trained model.

A Full Hardware Guide to Deep Learning - Tim Dettmers

In this guide I analyse hardware from CPU to SSD and their impact on performance for deep learning so that you can choose...