'loss: nan' error while training with standard yolo_loss

Hi David,

I just want to report a glitch in my experiments… I am training models (my own dataset = 27,000 annotations, 1 class) with the following cmd line:

python3 train.py --model_type yolo3_mobilenetv2_lite --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --save_eval_checkpoint --batch_size 16 --eval_online --eval_epoch_interval 3 --transfer_epoch 2 --freeze_level 1 --total_epoch 20

This is just an example, I tried a half dozen combinations of backbones and heads… Out of 10 trials, I only managed to reach epoch=20 twice. In the other cases, at some point (usually around epoch 4 to 9) I get a crash with this typical message:

705/1106 [==================>...........] - ETA: 7:54 - loss: 9.8939 - location_loss: 3.5176 - confidence_loss: 4.8495 - class_loss: 0.0014Batch 705: Invalid loss, terminating training

706/1106 [==================>...........] - ETA: 7:52 - loss: nan - location_loss: nan - confidence_loss: nan - class_loss: nan Traceback (most recent call last): File "train.py", line 252, in <module>

For the record, I work in Ubuntu 18.04, with tf2.1 and I pulled the lastest commits from your repo.

So I switched to ‘use_diou_loss=True’ and so far all is fine, with a much better convergence than previously. This looks to be a very helpful addition !

Gilles

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:21 (6 by maintainers)

Top GitHub Comments

1reaction

gillmac13commented, May 22, 2020

Hi @farhodbekshamsiyev

It has been a while, and I haven’t tried the newest yolo v4 version, but the model type which worked best for me (1 class underwater object recognition) was clearly yolo3_spp. And since I needed the speed and compactness, I found mobilenetV2_lite very effective. Out of 7 different combinations of backbones and yolo versions, this choice is the clear winner (for my application). Since yolo v4 also uses the spp feature, I suppose it must be good or better. I intend to train on the combo yolo4+mobilenetv3 very soon. Anyway this is what I did (please note that the batch size of 16 is required as my GPU has a storage capacity of 8 GB): In what follows, my_yolo_class.txt has only 1 class, and my images are all 416x416x3

Precautions: before launching a training command line, and to avoid crashes… 1- I switched to diou_loss (and nms_diou) by setting to true “use_diou_loss” in /yolo3 /loss.py at line 230 (?) 2- In /common/utils.py, line 20(?), I changed the memory_limit = 7000 3- I disabled the Mixed Precision Training at the beginning of train.py

Command line: $ python3 train.py --model_type yolo3_mobilenetv2_lite_spp --annotation_file train.txt --val_annotation_file valid.txt --classes_path configs/my_yolo_class.txt --anchors_path=configs/yolo3_anchors.txt --batch_size 16 --transfer_epoch 4 --freeze_level 1 --total_epoch 40 --optimizer rmsprop --decay_type cosine

Also note that I switched from adam to rmsprop. I am not sure which of the changes from standard training really helped, but since it worked for me, I am happy! With 27000 images in my dataset, I found that after 40 epochs the total loss did not change anymore, therefore I stopped training at that point. Using the model on my test set was pretty good, so I suppose I did it right.

I hope it helps…

0reactions

yakhyocommented, Jun 15, 2021

I am having the same error after almost one and half year @david8862. Is there any exact reason for this kind of Gradient Exploding? I think this nan comes from gradient exploding. Doesn’t It? Would be great if you share your experience on it. Accually I am using YOLOv1