Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI]

See original GitHub issue

Hi All,

Having an error being thrown at me when trying to log my metrics and hyperparameters on W&B via PyTorch Lightning whilst running on 8 TPU cores.

I first initialize the Weights and Biases run and project using the Lightning WandbLogger class, which practically runs wandb.init(). That goes fine. But then, I run the Trainer on 8 TPU cores, and with keyword argument ‘logger=my_WandbLogger’, I get the error AssertionError: can only test a child process.

image

Note that I tried this on a single TPU core, and that went fine and dandy. So it seems to be a problem with the distributive processing part of things.

How to reproduce This isn’t my code, but someone had the same issue a while back, although I couldn’t find their solution. It’s done using the bug-reproducer template (‘The Boring Model’) that Pytorch Lightning uses. Reproduction HERE.

I’m running things on Google Colab, with Pytorch Lighting version 1.2.4 (most recent) and W&B version 0.10.22 (one version behind the latest version).

Here’s the full error stack trace if you’re curious

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
<ipython-input-28-6650dc1eec9a> in <module>()
      3 wbLogger = WandbLogger(project='HPA Protein Localization Single Class Subset', name='Adam-128-0.001')
      4 trainer = Trainer(logger=wbLogger, deterministic=True, tpu_cores=8, max_epochs=epochNum, replace_sampler_ddp=False)
----> 5 trainer.fit(model, trainDL, valDL)
      6 
      7 print(time.time() - t0)

6 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    148         msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    149         msg += original_trace
--> 150         raise ProcessRaisedException(msg, error_index, failed_process.pid)
    151 
    152 

ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1028, in emit
    stream.write(msg + self.terminator)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 83, in new_process
    seed_everything(int(seed))
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/seed.py", line 54, in seed_everything
    log.info(f"Global seed set to {seed}")
  File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1033, in emit
    self.handleError(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 946, in handleError
    sys.stderr.write('--- Logging error ---\n')
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
    file=sys.stderr)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
    cb(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
    self._backend.interface.publish_output(name, data)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
    self._publish_output(o)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
    if self._process and not self._process.is_alive():
  File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

I’m wondering if there are any temporary workarounds for now since I need to find a way to connect and things are a bit time-sensitive!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:7
  • Comments:31 (11 by maintainers)

github_iconTop GitHub Comments

5reactions
KyleGoyettecommented, Jul 24, 2021

@prikmm Either the next release or the release after will have a new experimental mode of running that supports this case.

2reactions
borisdaymacommented, Mar 22, 2022

@leoleoasd I think you can just add wandb.require("service") at the top of your script.

Read more comments on GitHub >

github_iconTop Results From Across the Web

can only test a child process" when using distributed TPU ...
Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI] #1994.
Read more >
Error while Multiprocessing in Dataloader - PyTorch Forums
Not sure if this is reported already but I am getting the following Assertion error in ... AssertionError: can only join a child...
Read more >
TPU Failures in colab - PyTorch Lightning
I am using wandb logger with TPU on colab and this keeps happening ... test a child process' AssertionError: can only test a...
Read more >
Tensor Processing Unit (TPU) - PyTorch Lightning
This error is raised when the XLA device is called outside the spawn process. Internally in TPUSpawn Strategy for training on multiple tpu...
Read more >
Python multiprocessing - AssertionError: can only join a child ...
I can change the old code to not use os.fork() but I'd also like to know why this caused this problem and if...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found