Multi-Node deepspeed calling running instead of launcher
See original GitHub issueSystem Info
- `Accelerate` version: 0.10.0
- Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- main_process_ip: None
- main_process_port: None
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': '/path/to/deepspeed_config.json', 'zero3_init_flag': False}
- fsdp_config: {}
deepspeed==0.6.5
{
"train_batch_size": 128,
"gradient_accumulation_steps": 1,
"gradient_clipping": 1,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 0.00001
},
"zero_optimization": {
"stage": 2,
"cpu_offload": true,
"contiguous_gradients": true,
"overlap_comm": true
}
}
Running on a slurm HPC.
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Install:
pip install accelerate
pip install deepspeed
Create the config:
Note: deepspeed config can be empty as the crash happens before it is opened.
Similarly, main_process_ip and main_process_port can be anything as they are not used before the crash.
{
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {
"deepspeed_config_file": "/path/to/deepspeed_config.json",
"zero3_init_flag": false
},
"distributed_type": "DEEPSPEED",
"fsdp_config": {},
"machine_rank": 0,
"main_process_ip": "0.0.0.0",
"main_process_port": 0000,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 2,
"num_processes": 8,
"use_cpu": false
}
Launch accelerate
Again, since the training script is not launched, it can be empty.
accelerate launch --config_file /path/to/accelerate_config.json /path/to/empty_script.py
Expected behavior
When launching a multinode deepspeed training script with this code https://github.com/huggingface/accelerate/blob/86ce737d7fc94f8000dbd5e13021d0411bb4204a/src/accelerate/commands/launch.py#L312-L327 it appears to be running the deepspeed runner where the arguments supplied are for the deepspeed launcher.
This means that I am getting this error when accelerate tries to launch deepspeed.
usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE]
[--num_nodes NUM_NODES] [--num_gpus NUM_GPUS]
[--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
[--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS]
[--force_multi] [--autotuning {tune,run}]
user_script ...
deepspeed: error: unrecognized arguments: --no_local_rank
Since accelerate is performing the same function as the deepspeed runner, I would expect accelerate to call the launcher directly on each of the nodes. Instead, it appears to be calling the runner on each of the nodes.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top Related StackOverflow Question
Ok, dug more into it and we need to rework completely
accelerate launchfordeepspeed. So, you should use the deepspeed launcher for now while we fix it!Ah, this is different indeed! I’ll have a look.