[CLI]: Wandb causes training to error - Dropped streaming file chunk

See original GitHub issue

Describe the bug

After training for several hours, capturing allot of data points wandb errors


import wandb
wandb.init(project="continous de rl", entity="samholt")
wandb.config.update({'learning_rate': 1e-3})

total_epochs = 10000000000

# Create model
model = get_model()

wandb.config.update({"model_dyn_parameters": scalar_count_of_model_parameters})

# Training loop
for epoch_i in range(total_epochs):
  wandb.log({"loss": track_loss, "epoch": epoch_i})
  wandb.watch(model)

The above model has about 18,000 - 250,000 parameters depending on the experiment. All experiments face the following issue of wandb erroring after training for a few hours (3 or so).

07:50:48,303 root INFO [epoch=0402|iter=0501] train_loss=0.04047        | s/it=0.11040
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)
Traceback (most recent call last):
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 223, in <module>
    logger.info(get_NL_worldmodel(retrain=True))
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 166, in get_NL_worldmodel
    loss_dyn_.backward()
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 264, in <lambda>
    handle = var.register_hook(lambda grad: _callback(grad, log_track))
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 262, in _callback
    self.log_tensor_stats(grad.data, name)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 238, in log_tensor_stats
    wandb.run._log(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1375, in _log
    self._partial_history_callback(data, step, commit)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1259, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 553, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 67, in _publish_partial_history
    self._publish(rec)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 150, in send_record_publish
    self.send_server_request(server_req)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 84, in send_server_request
    self._send_message(msg)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
ConnectionResetError: [Errno 104] Connection reset by peer
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe

Additional Files

debug-internal.log shows

2022-08-23 07:50:51,832 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,833 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,834 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:52,180 DEBUG   SenderThread:462244 [sender.py:send():302] send: history
2022-08-23 07:50:53,904 INFO    Thread-12 :462244 [dir_watcher.py:_on_file_modified():292] file/dir modified: /home/sam/code/continous_control/continousderl/wandb/run-20220823_040512-26dgjkcf/files/output.log
2022-08-23 07:50:55,665 DEBUG   SenderThread:462244 [sender.py:send():302] send: summary
2022-08-23 07:50:55,804 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:55,923 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history

Environment

WandB version: ‘0.13.1’

OS: Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una

Python version: Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux

Versions of relevant libraries:

Additional Context

Please do let me know how I can solve this problem, as if wandb causes training runs to crash, then this makes it unusable.

Thank you so much for helping and your time.

Best, Sam

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ramit-wandbcommented, Aug 23, 2022

Hi @samholt!

That does not seem to be the whole debug-internal.log, would it be possible for you to share the complete log for me to have a look through once? Also wanted to know if you are doing something special / unique with multiprocessing.

I’ll help investigate this and get a solution going for you.

Thanks, Ramit

0reactions
samholtcommented, Sep 17, 2022

Dear @ramit-wandb,

I fixed it by not logging so much data all the time, by removing the line wandb.watch(model).

Thank you so much for the help as well,

I love weights and biases and consistently using it for all my work !

Keep up the great work,

Best, Sam

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CLI] wandb sync fails to upload reports from crashed scripts ...
Describe the bug If I run a script and then terminate it (ex. Ctrl-c) or it crashes for misc. reasons, I cannot use...
Read more >
Troubleshooting - Documentation - Weights & Biases - Wandb
We run wandb in a separate process to make sure that if wandb somehow crashes, your training will continue to run. If the...
Read more >
Issues · wandb/wandb · GitHub
This repo contains the CLI and Python API. - Issues · wandb/wandb. ... [CLI]: Wandb causes training to error - Dropped streaming file...
Read more >
Weights and Biases: Login and network errors - Stack Overflow
This error happens when I use the command: wandb login host=https://api.wandb.ai I have tried ... line 677, in urlopen chunked=chunked, File ...
Read more >
Fine-tuning - API - OpenAI
Fine-tuning improves on few-shot learning by training on many more ... You can use our CLI data preparation tool to easily convert your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found