[CLI]: Wandb causes training to error - Dropped streaming file chunk

Describe the bug

After training for several hours, capturing allot of data points wandb errors


import wandb
wandb.init(project="continous de rl", entity="samholt")
wandb.config.update({'learning_rate': 1e-3})

total_epochs = 10000000000

# Create model
model = get_model()

wandb.config.update({"model_dyn_parameters": scalar_count_of_model_parameters})

# Training loop
for epoch_i in range(total_epochs):
  wandb.log({"loss": track_loss, "epoch": epoch_i})
  wandb.watch(model)

The above model has about 18,000 - 250,000 parameters depending on the experiment. All experiments face the following issue of wandb erroring after training for a few hours (3 or so).

07:50:48,303 root INFO [epoch=0402|iter=0501] train_loss=0.04047        | s/it=0.11040
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)
Traceback (most recent call last):
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 223, in <module>
    logger.info(get_NL_worldmodel(retrain=True))
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 166, in get_NL_worldmodel
    loss_dyn_.backward()
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 264, in <lambda>
    handle = var.register_hook(lambda grad: _callback(grad, log_track))
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 262, in _callback
    self.log_tensor_stats(grad.data, name)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 238, in log_tensor_stats
    wandb.run._log(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1375, in _log
    self._partial_history_callback(data, step, commit)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1259, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 553, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 67, in _publish_partial_history
    self._publish(rec)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 150, in send_record_publish
    self.send_server_request(server_req)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 84, in send_server_request
    self._send_message(msg)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
ConnectionResetError: [Errno 104] Connection reset by peer
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe

Additional Files

debug-internal.log shows

2022-08-23 07:50:51,832 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,833 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,834 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:52,180 DEBUG   SenderThread:462244 [sender.py:send():302] send: history
2022-08-23 07:50:53,904 INFO    Thread-12 :462244 [dir_watcher.py:_on_file_modified():292] file/dir modified: /home/sam/code/continous_control/continousderl/wandb/run-20220823_040512-26dgjkcf/files/output.log
2022-08-23 07:50:55,665 DEBUG   SenderThread:462244 [sender.py:send():302] send: summary
2022-08-23 07:50:55,804 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:55,923 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history

Environment

WandB version: ‘0.13.1’

OS: Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una

Python version: Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux

Versions of relevant libraries:

Additional Context

Please do let me know how I can solve this problem, as if wandb causes training runs to crash, then this makes it unusable.

Thank you so much for helping and your time.

Best, Sam

Issue Analytics

State:
Created a year ago
Comments:9 (2 by maintainers)

Top GitHub Comments

1reaction

ramit-wandbcommented, Aug 23, 2022

Hi @samholt!

That does not seem to be the whole debug-internal.log, would it be possible for you to share the complete log for me to have a look through once? Also wanted to know if you are doing something special / unique with multiprocessing.

I’ll help investigate this and get a solution going for you.

Thanks, Ramit

0reactions

samholtcommented, Sep 17, 2022

Dear @ramit-wandb,

I fixed it by not logging so much data all the time, by removing the line wandb.watch(model).

Thank you so much for the help as well,

I love weights and biases and consistently using it for all my work !

Keep up the great work,

Best, Sam