Training error: NaN or Inf found in input tensor.

See original GitHub issue

I used the cityscapes_cv0_wideresnet38_nosdcaug.pth as pretrained model and tried to train on cityscapes just to make sure the training works on cityscapes. The first 4 epochs were successful, but at some point from the fifth epoch, I got Warning: NaN or Inf found in input tensor. then the train main loss suddenly became nan (it was about 0.29) and the training failed from then.

Did you experience this? How to solve it?

Training options:

#!/usr/bin/env bash

    # Example on Cityscapes
     python -m torch.distributed.launch --nproc_per_node=2 train.py \
        --dataset cityscapes \
        --cv 0 \
        --arch network.deepv3.DeepWV3Plus \
        --snapshot ./ckpts/cityscapes_cv0_wideresnet38_nosdcaug.pth \
        --class_uniform_pct 0.5 \
        --class_uniform_tile 1024 \
        --max_cu_epoch 150 \
        --lr 0.001 \
        --lr_schedule scl-poly \
        --poly_exp 1.0 \
        --repoly 1.5  \
        --rescale 1.0 \
        --syncbn \
        --sgd \
        --crop_size 896 \
        --scale_min 0.5 \
        --scale_max 2.0 \
        --color_aug 0.25 \
        --gblur \
        --max_epoch 175 \
        --coarse_boost_classes 14,15,16,3,12,17,4 \
        --jointwtborder \
        --strict_bdr_cls 5,6,7,11,12,17,18 \
        --rlx_off_epoch 100 \
        --wt_bound 1.0 \
        --bs_mult 1 \
        --apex \
        --exp cityscapes_ft \
        --ckpt ./logs/ \
        --tb_path ./logs/

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

chipkajbcommented, Dec 2, 2019

I ran into this problem as well. Lowering the learning rate sometimes helped, but not always. Sometimes, no matter how small I made the learning rate, I still ran into this issue. I got around it by just skipping the train main loss update whenever the train main loss was nan. In train.py, I replaced this line

train_main_loss.update(log_main_loss.item(), batch_pixel_size)

with this

if(torch.isnan(main_loss)):
    logging.info("Train main loss is nan. Skipping train main loss update")
else:
    train_main_loss.update(log_main_loss.item(), batch_pixel_size)

It seems to have resolved the problem for me.

0reactions

chipkajbcommented, Jul 6, 2020

@looong96 I am not sure how helpful I will be, but what do you mean my solution did not work for you? Is your main loss always nan? For me, my main loss would be nan very infrequently (maybe 1% of the time, I don’t remember…). So I would just skip that iteration (with the if/else block of code I gave above) and then proceed normally. I’m sure it’s not the best way to resolve the problem, but it was a quick workaround that worked for me. According to your new error (‘RuntimeError: Trying to backward through the graph a second time’), it looks like something else is going wrong. Something must be going wrong when you try to replace the nan loss value with the previous loss value. I was worried I would run into something like this, so that is why I simply skipped any iteration that produced a nan loss.