Gradient accumulation doesn't work with Accelerate's `clip_grad_norm_`

See original GitHub issue

System Info

- `Accelerate` version: 0.13.0.dev0
- Platform: Linux-5.10.133+-x86_64-with-debian-bullseye-sid
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour: You can directly run this colab notebook to get the error.

The main training method in the Trainer class is train_one_epoch

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

This will result in the following error:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-21-5a5fa8902df5>:2 in <module>                                                    │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/launchers.py:83 in notebook_launcher           │
│                                                                                                  │
│    80 │   │   │   │   print("Launching training on one GPU.")                                    │
│    81 │   │   │   else:                                                                          │
│    82 │   │   │   │   print("Launching training on one CPU.")                                    │
│ ❱  83 │   │   │   function(*args)                                                                │
│    84 │                                                                                          │
│    85 │   else:                                                                                  │
│    86 │   │   if num_processes is None:                                                          │
│ <ipython-input-20-cd919093f91a>:16 in main                                                       │
│ <ipython-input-19-44ed46a0baca>:265 in fit                                                       │
│ <ipython-input-19-44ed46a0baca>:215 in train_one_epoch                                           │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:920 in clip_grad_norm_          │
│                                                                                                  │
│    917 │   │   elif self.distributed_type == DistributedType.DEEPSPEED:                          │
│    918 │   │   │   # `accelerator.backward(loss)` is doing that automatically. Therefore, it's   │
│    919 │   │   │   return                                                                        │
│ ❱  920 │   │   self.unscale_gradients()                                                          │
│    921 │   │   torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)         │
│    922 │                                                                                         │
│    923 │   def clip_grad_value_(self, parameters, clip_value):                                   │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:904 in unscale_gradients        │
│                                                                                                  │
│    901 │   │   │   for opt in optimizer:                                                         │
│    902 │   │   │   │   while isinstance(opt, AcceleratedOptimizer):                              │
│    903 │   │   │   │   │   opt = opt.optimizer                                                   │
│ ❱  904 │   │   │   │   self.scaler.unscale_(opt)                                                 │
│    905 │                                                                                         │
│    906 │   def clip_grad_norm_(self, parameters, max_norm, norm_type=2):                         │
│    907 │   │   """                                                                               │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:270 in unscale_             │
│                                                                                                  │
│   267 │   │   optimizer_state = self._per_optimizer_states[id(optimizer)]                        │
│   268 │   │                                                                                      │
│   269 │   │   if optimizer_state["stage"] is OptState.UNSCALED:                                  │
│ ❱ 270 │   │   │   raise RuntimeError("unscale_() has already been called on this optimizer sin   │
│   271 │   │   elif optimizer_state["stage"] is OptState.STEPPED:                                 │
│   272 │   │   │   raise RuntimeError("unscale_() is being called after step().")                 │
│   273                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: unscale_() has already been called on this optimizer since the last update().

Expected behavior

clip_grad_norm_ works fine with gradient_accumulation_steps=1, but results in error when gradient_accumulation_steps is set greater than 1.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
Gladiator07commented, Aug 18, 2022

Thanks, @muellerzr, that did work. However, unscale_gradients is not required as accelerate does it in clip_grad_norm_ (source code here)

So, the final loop looks like this

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        if self._accelerator.sync_gradients:
            self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

Thanks again. Closing this issue. I love this library 😃

1reaction
Gladiator07commented, Aug 18, 2022

Sure, I will be happy to do it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Potential bug with gradient clipping when using ... - GitHub
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity...
Read more >
DDP with Gradient accumulation and clip grad norm - distributed
Hello, I am trying to do gradient accumulation model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions ...
Read more >
Why Gradient Clipping Accelerates Training: A Theoretical ...
This problem is circumvented by clipping because adaptivity allows the gradient descent to automatically use a small step size in steep regions with...
Read more >
[D] Does gradient accumulation achieve anything different ...
Gradient accumulation is a way to use a batch size that doesn't fit in memory, and thus is only useful in particular niche...
Read more >
How to do gradient clipping in pytorch? - Stack Overflow
I have an exploding gradients problem. python · machine-learning · deep-learning · pytorch · gradient-descent.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found