Gradient accumulation doesn't work with Accelerate's `clip_grad_norm_`

System Info

- `Accelerate` version: 0.13.0.dev0
- Platform: Linux-5.10.133+-x86_64-with-debian-bullseye-sid
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour: You can directly run this colab notebook to get the error.

The main training method in the Trainer class is train_one_epoch

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

This will result in the following error:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ <ipython-input-21-5a5fa8902df5>:2 in <module>                                                    │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/launchers.py:83 in notebook_launcher           │
│                                                                                                  │
│    80 │   │   │   │   print("Launching training on one GPU.")                                    │
│    81 │   │   │   else:                                                                          │
│    82 │   │   │   │   print("Launching training on one CPU.")                                    │
│ ❱  83 │   │   │   function(*args)                                                                │
│    84 │                                                                                          │
│    85 │   else:                                                                                  │
│    86 │   │   if num_processes is None:                                                          │
│ <ipython-input-20-cd919093f91a>:16 in main                                                       │
│ <ipython-input-19-44ed46a0baca>:265 in fit                                                       │
│ <ipython-input-19-44ed46a0baca>:215 in train_one_epoch                                           │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:920 in clip_grad_norm_          │
│                                                                                                  │
│    917 │   │   elif self.distributed_type == DistributedType.DEEPSPEED:                          │
│    918 │   │   │   # `accelerator.backward(loss)` is doing that automatically. Therefore, it's   │
│    919 │   │   │   return                                                                        │
│ ❱  920 │   │   self.unscale_gradients()                                                          │
│    921 │   │   torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)         │
│    922 │                                                                                         │
│    923 │   def clip_grad_value_(self, parameters, clip_value):                                   │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:904 in unscale_gradients        │
│                                                                                                  │
│    901 │   │   │   for opt in optimizer:                                                         │
│    902 │   │   │   │   while isinstance(opt, AcceleratedOptimizer):                              │
│    903 │   │   │   │   │   opt = opt.optimizer                                                   │
│ ❱  904 │   │   │   │   self.scaler.unscale_(opt)                                                 │
│    905 │                                                                                         │
│    906 │   def clip_grad_norm_(self, parameters, max_norm, norm_type=2):                         │
│    907 │   │   """                                                                               │
│                                                                                                  │
│ /usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:270 in unscale_             │
│                                                                                                  │
│   267 │   │   optimizer_state = self._per_optimizer_states[id(optimizer)]                        │
│   268 │   │                                                                                      │
│   269 │   │   if optimizer_state["stage"] is OptState.UNSCALED:                                  │
│ ❱ 270 │   │   │   raise RuntimeError("unscale_() has already been called on this optimizer sin   │
│   271 │   │   elif optimizer_state["stage"] is OptState.STEPPED:                                 │
│   272 │   │   │   raise RuntimeError("unscale_() is being called after step().")                 │
│   273                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: unscale_() has already been called on this optimizer since the last update().

Expected behavior

clip_grad_norm_ works fine with gradient_accumulation_steps=1, but results in error when gradient_accumulation_steps is set greater than 1.

Issue Analytics

State:
Created a year ago
Comments:12 (9 by maintainers)

Top GitHub Comments

2reactions

Gladiator07commented, Aug 18, 2022

Thanks, @muellerzr, that did work. However, unscale_gradients is not required as accelerate does it in clip_grad_norm_ (source code here)

So, the final loop looks like this

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        if self._accelerator.sync_gradients:
            self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

Thanks again. Closing this issue. I love this library 😃

1reaction

Gladiator07commented, Aug 18, 2022

Sure, I will be happy to do it!

Top Results From Across the Web

Potential bug with gradient clipping when using ... - GitHub

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity...

DDP with Gradient accumulation and clip grad norm - distributed

Hello, I am trying to do gradient accumulation model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions ...

Why Gradient Clipping Accelerates Training: A Theoretical ...

This problem is circumvented by clipping because adaptivity allows the gradient descent to automatically use a small step size in steep regions with...

[D] Does gradient accumulation achieve anything different ...

Gradient accumulation is a way to use a batch size that doesn't fit in memory, and thus is only useful in particular niche...

How to do gradient clipping in pytorch? - Stack Overflow

I have an exploding gradients problem. python · machine-learning · deep-learning · pytorch · gradient-descent.