RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

See original GitHub issue

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

The following error appears upon finishing epoch 240 (last one): Default process group has not been initialized, please make sure to call init_process_group.

Additionally, it seems as if the classifier does not learn at all - accuracy remains the same as it was in the first steps even after 240 epochs. Reproduction

  1. What command or script did you run?
python train.py
configs/skeleton/posec3d/my_config.py
--work-dir
work_dirs/my_workdir
--validate
--test-best
--gpus
1
--seed
0
--deterministic
  1. Did you make any modifications on the code or config? Did you understand what you have modified?

Only the configuration changes mentioned in: https://github.com/open-mmlab/mmaction2/blob/master/configs/skeleton/posec3d/custom_dataset_training.md

  1. What dataset did you use?

A private dataset of real people performing several pre-defined types of repetitive actions. This dataset contains approximately 5000 samples with 13 different classes.

Environment

'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0,1: NVIDIA GeForce RTX 3080
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.5.4
MMCV: 1.3.8
MMCV Compiler: MSVC 192829912
MMCV CUDA Compiler: 11.3
MMAction2: 0.20.0+61d7eb8

Error traceback

If applicable, paste the error traceback here.

021-12-23 22:57:06,396 - mmaction - INFO - 
top1_acc	0.2789
top5_acc	0.7224
2021-12-23 22:57:06,396 - mmaction - INFO - Evaluating mean_class_accuracy ...
2021-12-23 22:57:06,398 - mmaction - INFO - 
mean_acc	0.0769
2021-12-23 22:57:06,398 - mmaction - INFO - Epoch(val) [240][98]	top1_acc: 0.2789, top5_acc: 0.7224, mean_class_accuracy: 0.0769
2021-12-23 22:57:08,563 - mmaction - INFO - 972 videos remain after valid thresholding
2021-12-23 22:57:08,564 - mmaction - INFO - load checkpoint from E:\mmaction2\work_dirs\autism_center\best_top1_acc_epoch_20.pth
2021-12-23 22:57:08,564 - mmaction - INFO - Use load_from_local loader
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 972/972, 1.1 task/s, elapsed: 907s, ETA:     0sTraceback (most recent call last):
  File "E:/mmaction2/tools/train.py", line 201, in <module>
    main()
  File "E:/mmaction2/tools/train.py", line 197, in main
    meta=meta)
  File "E:\mmaction2\mmaction\apis\train.py", line 254, in train_model
    gpu_collect)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 86, in multi_gpu_test
    results = collect_results_cpu(results, len(dataset), tmpdir)
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\mmcv\engine\test.py", line 129, in collect_results_cpu
    dist.barrier()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "C:\Users\owner\anaconda3\envs\mmlab\lib\site-packages\torch\distributed\distributed_c10d.py", line 411, in _get_default_group
    "Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Process finished with exit code 1

I couldn’t identify the cause for this error, nor the reason for the low accuracy. Other skeleton-based algorithms managed to learn on this dataset.

Thanks in advance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:17 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
kennymckormickcommented, Feb 21, 2022

The problems have been fixed now. BTW, we highly recommend users to use distributed training and testing (you can use it even if you have only 1 GPU). The command for distributed training is just like: bash tools/dist_train.sh {config} {num_gpus} {other_args …}

Hi, I still met the "Default process group has not been initialized, " error when I run the mmaction2_tutorial.ipynb file with the latest version’s code. The accuracy is normal, but training shows the error in the 10th epoch.

Has this problem fixed by PR #1459?

1reaction
kennymckormickcommented, Jan 21, 2022

Not sure, seems everything OK. BTW, have you set img_shape and original_shape as the real video shape (height, width) for each video?

Thank you very much for your work and answers. Where should I set the img_shape and original_shape in code?Cause I met the same error running slowfast model in my own dataset.

Sorry, the answer seems to be not related to your problem: setting video shape is just required for PoseC3D models.

Read more comments on GitHub >

github_iconTop Results From Across the Web

process group has not been initialized, please make sure ...
I'm on a ubuntu environment with jupyter notebook. RuntimeError: Default process group has not been initialized, please make sure to call ...
Read more >
Default process group has not been initialized, please ...
I have tried to train detectron2 using LazyConfig on single GPU but I ... process group has not been initialized, please make sure...
Read more >
PyTorch-Lightning/community
hi, anyone knows how to debug "Default process group is not initialized" error when using dp mode? in torch.utils.data.distributed.DistributedSampler.
Read more >
Multi-GPU Training unet_learner and parallel_ctx
RuntimeError : Default process group has not been initialized, please make sure to call init_process_group. Then I tried to initialize the ...
Read more >
Why does SageMaker PyTorch DDP init times out on ...
This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found