[Dreambooth Example] Attempting to unscale FP16 gradients.

Describe the bug

I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:

Traceback (most recent call last):
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
    main(args)
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
    optimizer.step()
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|          | 0/800 [00:18<?, ?it/s]

Any ideas, or do I need to downgrade?

Reproduction

No response

Logs

No response

System Info

diffusers 0.7.2 python 3.7.12 accelerate 0.14.0

Issue Analytics

State:
Created 10 months ago
Comments:26 (11 by maintainers)

Top GitHub Comments

2reactions

patil-surajcommented, Nov 21, 2022

Thanks for the detailed issue, taking a look now.

1reaction

gadicccommented, Dec 1, 2022

Hi all, sorry for the radio silence… some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.

Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).

🙏