RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug

The following error appears upon finishing epoch 240 (last one): Default process group has not been initialized, please make sure to call init_process_group.

Additionally, it seems as if the classifier does not learn at all - accuracy remains the same as it was in the first steps even after 240 epochs. Reproduction

What command or script did you run?

python train.py
configs/skeleton/posec3d/my_config.py
--work-dir
work_dirs/my_workdir
--validate
--test-best
--gpus
1
--seed
0
--deterministic

Did you make any modifications on the code or config? Did you understand what you have modified?

Only the configuration changes mentioned in: https://github.com/open-mmlab/mmaction2/blob/master/configs/skeleton/posec3d/custom_dataset_training.md

What dataset did you use?

A private dataset of real people performing several pre-defined types of repetitive actions. This dataset contains approximately 5000 samples with 13 different classes.

Environment

'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 3080
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.3.8
MMCV Compiler: MSVC 192829912
MMCV CUDA Compiler: 11.3
MMAction2: 0.20.0+61d7eb8

Error traceback

If applicable, paste the error traceback here.

021-12-23 22:57:06,396 - mmaction - INFO - 
top1_acc	0.2789
top5_acc	0.7224
2021-12-23 22:57:06,396 - mmaction - INFO - Evaluating mean_class_accuracy ...
2021-12-23 22:57:06,398 - mmaction - INFO - 
mean_acc	0.0769
2021-12-23 22:57:06,398 - mmaction - INFO - Epoch(val) [240][98]	top1_acc: 0.2789, top5_acc: 0.7224, mean_class_accuracy: 0.0769
2021-12-23 22:57:08,563 - mmaction - INFO - 972 videos remain after valid thresholding
2021-12-23 22:57:08,564 - mmaction - INFO - load checkpoint from E:\mmaction2\work_dirs\autism_center\best_top1_acc_epoch_20.pth
2021-12-23 22:57:08,564 - mmaction - INFO - Use load_from_local loader
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 972/972, 1.1 task/s, elapsed: 907s, ETA:     0sTraceback (most recent call last):
  File "E:/mmaction2/tools/train.py", line 201, in <module>
    main()
  File "E:/mmaction2/tools/train.py", line 197, in main
    meta=meta)
  File "E:\mmaction2\mmaction\apis\train.py", line 254, in train_model
    gpu_collect)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 86, in multi_gpu_test
    results = collect_results_cpu(results, len(dataset), tmpdir)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 129, in collect_results_cpu
    dist.barrier()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 411, in _get_default_group
    "Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

I couldn’t identify the cause for this error, nor the reason for the low accuracy. Other skeleton-based algorithms managed to learn on this dataset.

Thanks in advance!

Issue Analytics

State:
Created 2 years ago
Comments:17 (1 by maintainers)

Top GitHub Comments

1reaction

kennymckormickcommented, Feb 21, 2022

The problems have been fixed now. BTW, we highly recommend users to use distributed training and testing (you can use it even if you have only 1 GPU). The command for distributed training is just like: bash tools/dist_train.sh {config} {num_gpus} {other_args …}

Hi, I still met the "Default process group has not been initialized, " error when I run the mmaction2_tutorial.ipynb file with the latest version’s code. The accuracy is normal, but training shows the error in the 10th epoch.

Has this problem fixed by PR #1459?

1reaction

kennymckormickcommented, Jan 21, 2022

Not sure, seems everything OK. BTW, have you set img_shape and original_shape as the real video shape (height, width) for each video?

Thank you very much for your work and answers. Where should I set the img_shape and original_shape in code？Cause I met the same error running slowfast model in my own dataset.

Sorry, the answer seems to be not related to your problem: setting video shape is just required for PoseC3D models.

Top Results From Across the Web

process group has not been initialized, please make sure ...

I'm on a ubuntu environment with jupyter notebook. RuntimeError: Default process group has not been initialized, please make sure to call ...

Default process group has not been initialized, please ...

I have tried to train detectron2 using LazyConfig on single GPU but I ... process group has not been initialized, please make sure...

PyTorch-Lightning/community

hi, anyone knows how to debug "Default process group is not initialized" error when using dp mode? in torch.utils.data.distributed.DistributedSampler.

Multi-GPU Training unet_learner and parallel_ctx

RuntimeError : Default process group has not been initialized, please make sure to call init_process_group. Then I tried to initialize the ...

Why does SageMaker PyTorch DDP init times out on ...

This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and ......