[Dreambooth Example] Attempting to unscale FP16 gradients.

See original GitHub issue

Describe the bug

I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:

Traceback (most recent call last):
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
    main(args)
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
    optimizer.step()
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|          | 0/800 [00:18<?, ?it/s]

Any ideas, or do I need to downgrade?

Reproduction

No response

Logs

No response

System Info

diffusers 0.7.2 python 3.7.12 accelerate 0.14.0

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:26 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
patil-surajcommented, Nov 21, 2022

Thanks for the detailed issue, taking a look now.

1reaction
gadicccommented, Dec 1, 2022

Hi all, sorry for the radio silence… some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.

Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).

šŸ™

Read more comments on GitHub >

github_iconTop Results From Across the Web

[0.4.1] ValueError: Attempting to unscale FP16 gradients. #834
For example, what kind of optimizer would be used with FP16 grads? How is the optimizer state being handled? Is there a reason...
Read more >
ValueError : Attemting to unscale fp16 Gradients
Hello all, I am trying to train an LSTM in the half-precision setting. The LSTM takes an encoded input from a pre-trained autoencoder(NotĀ ......
Read more >
Automatic Mixed Precision Using PyTorch - Paperspace Blog
Data from the FP16 pipeline is processed using Tensor Cores to conduct GEMMs ... You may unscale the gradients of other parameters that...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
Porting the model to use the FP16 data type where appropriate. Adding loss scaling to preserve small gradient values. The ability to train...
Read more >
Help & Questions Megathread! : r/StableDiffusion - Reddit
problem when i click train (dream booth) (automatic 1111). "Returning result: Training finished. Total lifetime steps: 0" is what i get 3 minsĀ ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found