Training error: NaN or Inf found in input tensor.
See original GitHub issueI used the cityscapes_cv0_wideresnet38_nosdcaug.pth as pretrained model and tried to train on cityscapes just to make sure the training works on cityscapes. The first 4 epochs were successful, but at some point from the fifth epoch, I got Warning: NaN or Inf found in input tensor. then the train main loss suddenly became nan (it was about 0.29) and the training failed from then.
Did you experience this? How to solve it?
Training options:
#!/usr/bin/env bash
# Example on Cityscapes
python -m torch.distributed.launch --nproc_per_node=2 train.py \
--dataset cityscapes \
--cv 0 \
--arch network.deepv3.DeepWV3Plus \
--snapshot ./ckpts/cityscapes_cv0_wideresnet38_nosdcaug.pth \
--class_uniform_pct 0.5 \
--class_uniform_tile 1024 \
--max_cu_epoch 150 \
--lr 0.001 \
--lr_schedule scl-poly \
--poly_exp 1.0 \
--repoly 1.5 \
--rescale 1.0 \
--syncbn \
--sgd \
--crop_size 896 \
--scale_min 0.5 \
--scale_max 2.0 \
--color_aug 0.25 \
--gblur \
--max_epoch 175 \
--coarse_boost_classes 14,15,16,3,12,17,4 \
--jointwtborder \
--strict_bdr_cls 5,6,7,11,12,17,18 \
--rlx_off_epoch 100 \
--wt_bound 1.0 \
--bs_mult 1 \
--apex \
--exp cityscapes_ft \
--ckpt ./logs/ \
--tb_path ./logs/
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
WARNING:root:NaN or Inf found in input tensor. #34 - GitHub
Hello, I've been watching this job recently. Do you use your own data set? What is your error message?
Read more >Nan or Inf Error during Training - PyTorch Forums
Now I get “Warning: NaN or Inf found in input tensor” all the time while training. I wonder if this has a negative...
Read more >Warning 'NaN or Inf found in input tensor. ' while training, and ...
The error happens at random iteration from hundreds to thousands without shuffling input data. Input data are ok because they are used in ......
Read more >WARNING:root:NaN or Inf found in input tensor . How to ...
I tried to regenerate the result on CTR-GCN [1] . The given code works well at initially. Then i changed my GPU Card...
Read more >训练出现:WARNING:root:NaN or Inf found in input tensor.
Tensorflow training error: LossTensor is inf or nan · Color Space的博客.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I ran into this problem as well. Lowering the learning rate sometimes helped, but not always. Sometimes, no matter how small I made the learning rate, I still ran into this issue. I got around it by just skipping the train main loss update whenever the train main loss was nan. In train.py, I replaced this line
train_main_loss.update(log_main_loss.item(), batch_pixel_size)with this
It seems to have resolved the problem for me.
@looong96 I am not sure how helpful I will be, but what do you mean my solution did not work for you? Is your main loss always nan? For me, my main loss would be nan very infrequently (maybe 1% of the time, I don’t remember…). So I would just skip that iteration (with the if/else block of code I gave above) and then proceed normally. I’m sure it’s not the best way to resolve the problem, but it was a quick workaround that worked for me. According to your new error (‘RuntimeError: Trying to backward through the graph a second time’), it looks like something else is going wrong. Something must be going wrong when you try to replace the nan loss value with the previous loss value. I was worried I would run into something like this, so that is why I simply skipped any iteration that produced a nan loss.