Get "AssertionError: can only test a child process" when using distributed TPU cores via Pytorch Lightning [CLI]
See original GitHub issueHi All,
Having an error being thrown at me when trying to log my metrics and hyperparameters on W&B via PyTorch Lightning whilst running on 8 TPU cores.
I first initialize the Weights and Biases run and project using the Lightning WandbLogger class, which practically runs wandb.init(). That goes fine. But then, I run the Trainer on 8 TPU cores, and with keyword argument ‘logger=my_WandbLogger’, I get the error AssertionError: can only test a child process.

Note that I tried this on a single TPU core, and that went fine and dandy. So it seems to be a problem with the distributive processing part of things.
How to reproduce This isn’t my code, but someone had the same issue a while back, although I couldn’t find their solution. It’s done using the bug-reproducer template (‘The Boring Model’) that Pytorch Lightning uses. Reproduction HERE.
I’m running things on Google Colab, with Pytorch Lighting version 1.2.4 (most recent) and W&B version 0.10.22 (one version behind the latest version).
Here’s the full error stack trace if you’re curious
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
---------------------------------------------------------------------------
ProcessRaisedException Traceback (most recent call last)
<ipython-input-28-6650dc1eec9a> in <module>()
3 wbLogger = WandbLogger(project='HPA Protein Localization Single Class Subset', name='Adam-128-0.001')
4 trainer = Trainer(logger=wbLogger, deterministic=True, tpu_cores=8, max_epochs=epochNum, replace_sampler_ddp=False)
----> 5 trainer.fit(model, trainDL, valDL)
6
7 print(time.time() - t0)
6 frames
/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
148 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
149 msg += original_trace
--> 150 raise ProcessRaisedException(msg, error_index, failed_process.pid)
151
152
ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1028, in emit
stream.write(msg + self.terminator)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 83, in new_process
seed_everything(int(seed))
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/seed.py", line 54, in seed_everything
log.info(f"Global seed set to {seed}")
File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log
self.handle(record)
File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle
self.callHandlers(record)
File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle
self.emit(record)
File "/usr/lib/python3.7/logging/__init__.py", line 1033, in emit
self.handleError(record)
File "/usr/lib/python3.7/logging/__init__.py", line 946, in handleError
sys.stderr.write('--- Logging error ---\n')
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 334, in _mp_start_fn
file=sys.stderr)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/redirect.py", line 100, in new_write
cb(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 796, in _console_callback
self._backend.interface.publish_output(name, data)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 187, in publish_output
self._publish_output(o)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 192, in _publish_output
self._publish(rec)
File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 517, in _publish
if self._process and not self._process.is_alive():
File "/usr/lib/python3.7/multiprocessing/process.py", line 151, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
I’m wondering if there are any temporary workarounds for now since I need to find a way to connect and things are a bit time-sensitive!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:7
- Comments:31 (11 by maintainers)
Top Related StackOverflow Question
@prikmm Either the next release or the release after will have a new experimental mode of running that supports this case.
@leoleoasd I think you can just add
wandb.require("service")at the top of your script.