Training error: NaN or Inf found in input tensor.

See original GitHub issue

I used the cityscapes_cv0_wideresnet38_nosdcaug.pth as pretrained model and tried to train on cityscapes just to make sure the training works on cityscapes. The first 4 epochs were successful, but at some point from the fifth epoch, I got Warning: NaN or Inf found in input tensor. then the train main loss suddenly became nan (it was about 0.29) and the training failed from then.

Did you experience this? How to solve it?

Training options:

#!/usr/bin/env bash

    # Example on Cityscapes
     python -m torch.distributed.launch --nproc_per_node=2 train.py \
        --dataset cityscapes \
        --cv 0 \
        --arch network.deepv3.DeepWV3Plus \
        --snapshot ./ckpts/cityscapes_cv0_wideresnet38_nosdcaug.pth \
        --class_uniform_pct 0.5 \
        --class_uniform_tile 1024 \
        --max_cu_epoch 150 \
        --lr 0.001 \
        --lr_schedule scl-poly \
        --poly_exp 1.0 \
        --repoly 1.5  \
        --rescale 1.0 \
        --syncbn \
        --sgd \
        --crop_size 896 \
        --scale_min 0.5 \
        --scale_max 2.0 \
        --color_aug 0.25 \
        --gblur \
        --max_epoch 175 \
        --coarse_boost_classes 14,15,16,3,12,17,4 \
        --jointwtborder \
        --strict_bdr_cls 5,6,7,11,12,17,18 \
        --rlx_off_epoch 100 \
        --wt_bound 1.0 \
        --bs_mult 1 \
        --apex \
        --exp cityscapes_ft \
        --ckpt ./logs/ \
        --tb_path ./logs/

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
chipkajbcommented, Dec 2, 2019

I ran into this problem as well. Lowering the learning rate sometimes helped, but not always. Sometimes, no matter how small I made the learning rate, I still ran into this issue. I got around it by just skipping the train main loss update whenever the train main loss was nan. In train.py, I replaced this line

train_main_loss.update(log_main_loss.item(), batch_pixel_size)

with this

if(torch.isnan(main_loss)):
    logging.info("Train main loss is nan. Skipping train main loss update")
else:
    train_main_loss.update(log_main_loss.item(), batch_pixel_size)

It seems to have resolved the problem for me.

0reactions
chipkajbcommented, Jul 6, 2020

@looong96 I am not sure how helpful I will be, but what do you mean my solution did not work for you? Is your main loss always nan? For me, my main loss would be nan very infrequently (maybe 1% of the time, I don’t remember…). So I would just skip that iteration (with the if/else block of code I gave above) and then proceed normally. I’m sure it’s not the best way to resolve the problem, but it was a quick workaround that worked for me. According to your new error (‘RuntimeError: Trying to backward through the graph a second time’), it looks like something else is going wrong. Something must be going wrong when you try to replace the nan loss value with the previous loss value. I was worried I would run into something like this, so that is why I simply skipped any iteration that produced a nan loss.

Read more comments on GitHub >

github_iconTop Results From Across the Web

WARNING:root:NaN or Inf found in input tensor. #34 - GitHub
Hello, I've been watching this job recently. Do you use your own data set? What is your error message?
Read more >
Nan or Inf Error during Training - PyTorch Forums
Now I get “Warning: NaN or Inf found in input tensor” all the time while training. I wonder if this has a negative...
Read more >
Warning 'NaN or Inf found in input tensor. ' while training, and ...
The error happens at random iteration from hundreds to thousands without shuffling input data. Input data are ok because they are used in ......
Read more >
WARNING:root:NaN or Inf found in input tensor . How to ...
I tried to regenerate the result on CTR-GCN [1] . The given code works well at initially. Then i changed my GPU Card...
Read more >
训练出现:WARNING:root:NaN or Inf found in input tensor.
Tensorflow training error: LossTensor is inf or nan · Color Space的博客.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found