FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Hi,

I’ve been running detectron2 using the tutorial colab book. today, while training using a dataset that has previously worked I got the following error:

`from detectron2.engine import DefaultTrainer from detectron2.config import get_cfg import os

cfg = get_cfg() cfg.merge_from_file(“./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml”) cfg.DATASETS.TRAIN = (“3test”,) cfg.DATASETS.TEST = () # no metrics implemented for this dataset cfg.DATALOADER.NUM_WORKERS = 4 cfg.MODEL.WEIGHTS = “detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl” # initialize from model zoo cfg.SOLVER.IMS_PER_BATCH = 2 cfg.SOLVER.BASE_LR = 0.1 cfg.SOLVER.MAX_ITER = 10000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 100
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 cfg.TEST.DETECTIONS_PER_IMAGE = 2000

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train()`

I get the following error after a few iterations:

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

I’d really appreciate any way of getting past this error.

Cheers,

Issue Analytics

State:
Created 3 years ago
Comments:10

Top GitHub Comments

10reactions

MiXaiLL76commented, Feb 5, 2021

I want to know whether you solve the problem or not?

set

num_gpu = 1
bs = (num_gpu * 2)
cfg.SOLVER.BASE_LR = 0.02 * bs / 16  # pick a good LR

7reactions

ppwwyyxxcommented, Mar 31, 2020

You probably need a smaller learning rate.

As the issue template mentions:

If you expect the model to converge / work better, note that we do not give suggestions on how to train a new model. Only in one of the two conditions we will help with it: (1) You’re unable to reproduce the results in detectron2 model zoo. (2) It indicates a detectron2 bug.

We provide configs & models with standard academic settings and expect users to have the knowledge to choose or design appropriate models & parameters for their own tasks.

Top Results From Across the Web

FloatingPointError: Predicted boxes or scores contain Inf/Nan ...

Error:FloatingPointError: Predicted boxes or scores contain Inf/Nan. Training has diverged. 经查阅，是learning_raye设置太大的原因，当时我的学习 ...

Give randomCrop augmentation and loss become explode

And why it explode the loss when I gave RandomCrop ? FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

CVPR/regionclip-demo at main - Hugging Face

... if training: raise FloatingPointError( "Predicted boxes or scores contain Inf/NaN. Training has diverged." ) boxes = boxes[valid_mask] scores_per_img ...

Dealing with NaNs and infs - Stable Baselines3 - Read the Docs

During the training of a model on a given environment, it is possible that the RL model becomes completely corrupted when a NaN...

SIIM COVID19 | Kaggle

FloatingPointError : Predicted boxes or scores contain Inf/NaN. Training has diverged. [08/23 10:39:16 d2.engine.hooks]: Overall training speed: 3 iterations ...