Timeout for Raylet heartbeat for I/O intensive workloads

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray: Ray-0.8.7 Python: Python-3.7 Tensorflow: Tensorflow-1.4 OS: Ubuntu-16.04 image on K8s

Context:

We are using Ray for data loading, basically the Ray actor loads both images and labels off of the disk and run some preprocessing (mostly numpy stuff).

Stack trace:

(pid=raylet) E1025 09:45:46.570907   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1025 09:45:47.035303     9   993 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modulus.multi_task_loader.dataloader.core.data.ray_iterator, class_name=RemoteFetcher, function_name=fetch, function_hash=}, task_id=6cd1c8101e919ffa55d5f0d50100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=55d5f0d50100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=1}
e2emapnet-hanging-fix-6jnny4-0-train-9153b4-c29:9:121 [0] NCCL INFO Destroyed comm 0x7f8d080412a0 rank 0
(pid=raylet) E1025 09:45:48.027905   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.027952   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028023   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028045   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028066   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028097   837   837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) F1025 09:45:48.036741   837   837 node_manager.cc:652]  Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @     0x5614df845a3d  google::LogMessage::Fail()
(pid=raylet)     @     0x5614df846b9c  google::LogMessage::SendToLog()
(pid=raylet)     @     0x5614df845719  google::LogMessage::Flush()
(pid=raylet)     @     0x5614df845931  google::LogMessage::~LogMessage()
(pid=raylet)     @     0x5614df7fc379  ray::RayLog::~RayLog()
(pid=raylet)     @     0x5614df54f334  ray::raylet::NodeManager::NodeRemoved()
(pid=raylet)     @     0x5614df54f4ec  _ZNSt17_Function_handlerIFvRKN3ray8ClientIDERKNS0_3rpc11GcsNodeInfoEEZNS0_6raylet11NodeManager11RegisterGcsEvEUlS3_S7_E0_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet)     @     0x5614df636390  ray::gcs::ServiceBasedNodeInfoAccessor::HandleNotification()
(pid=raylet)     @     0x5614df636666  _ZNSt17_Function_handlerIFvRKSsS1_EZZN3ray3gcs28ServiceBasedNodeInfoAccessor26AsyncSubscribeToNodeChangeERKSt8functionIFvRKNS3_8ClientIDERKNS3_3rpc11GcsNodeInfoEEERKS6_IFvNS3_6StatusEEEENKUlSM_E0_clESM_EUlS1_S1_E_E9_M_invokeERKSt9_Any_dataS1_S1_
(pid=raylet)     @     0x5614df640d0a  _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray3gcs13CallbackReplyEEEZNS2_9GcsPubSub24ExecuteCommandIfPossibleERKSsRNS6_7ChannelEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
(pid=raylet)     @     0x5614df6427cb  _ZN5boost4asio6detail18completion_handlerIZN3ray3gcs20RedisCallbackManager12CallbackItem8DispatchERSt10shared_ptrINS4_13CallbackReplyEEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet)     @     0x5614dfb292af  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @     0x5614dfb2a7b1  boost::asio::detail::scheduler::run()
(pid=raylet)     @     0x5614dfb2b7e2  boost::asio::io_context::run()
(pid=raylet)     @     0x5614df4bbb52  main
(pid=raylet)     @     0x7fe485b9eb97  __libc_start_main
(pid=raylet)     @     0x5614df4cbf91  (unknown)
(pid=25446) F1025 09:45:49.821868 25446 25446 raylet_client.cc:106]  Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: Connection reset by peer
(pid=25446) *** Check failure stack trace: ***
(pid=25446)     @     0x7f52ca76b6cd  google::LogMessage::Fail()
(pid=25446)     @     0x7f52ca76c82c  google::LogMessage::SendToLog()
(pid=25446)     @     0x7f52ca76b3a9  google::LogMessage::Flush()
(pid=25446)     @     0x7f52ca76b5c1  google::LogMessage::~LogMessage()
(pid=25446)     @     0x7f52ca722ce9  ray::RayLog::~RayLog()
(pid=25446)     @     0x7f52ca472074  ray::raylet::RayletClient::RayletClient()
(pid=25446)     @     0x7f52ca412f30  ray::CoreWorker::CoreWorker()
(pid=25446)     @     0x7f52ca416e24  ray::CoreWorkerProcess::CreateWorker()
(pid=25446)     @     0x7f52ca417f42  ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=25446)     @     0x7f52ca4188ab  ray::CoreWorkerProcess::Initialize()
(pid=25446)     @     0x7f52ca371a7d  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
(pid=25446)     @     0x7f52ca372d05  __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=25446)     @           0x551365  (unknown)
(pid=25446)     @           0x5a9cbc  _PyObject_FastCallKeywords
(pid=25446)     @           0x50a5c3  (unknown)
(pid=25446)     @           0x50bfb4  _PyEval_EvalFrameDefault
(pid=25446)     @           0x507d64  (unknown)
(pid=25446)     @           0x509a90  (unknown)
(pid=25446)     @           0x50a48d  (unknown)
(pid=25446)     @           0x50cd96  _PyEval_EvalFrameDefault
(pid=25446)     @           0x507d64  (unknown)
(pid=25446)     @           0x50ae13  PyEval_EvalCode
(pid=25446)     @           0x634c82  (unknown)
(pid=25446)     @           0x634d37  PyRun_FileExFlags
(pid=25446)     @           0x6384ef  PyRun_SimpleFileExFlags
(pid=25446)     @           0x639091  Py_Main
(pid=25446)     @           0x4b0d00  main
(pid=25446)     @     0x7f52cd0d1b97  __libc_start_main
(pid=25446)     @           0x5b250a  _start
(pid=25412) E1025 09:45:49.886169 25412 26225 core_worker.cc:691] Raylet failed.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

It might be hard to reproduce as this might be an issue coupled with our storage system

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
WangTaoTheToniccommented, Oct 28, 2020

I think for this case, the heartbeat is missing from raylet. I’ve seen many WARNINGs from raylets that the heartbeat update was not reported on time from logs of users who reported this issue in the past.

From our logs I also see lots of delayed heartbeat reporting(those WARNINGs, in test environment, host machines are oversold very much).

In this case I’m not sure if raylet really didn’t report heartbeat for 30 seconds. That’s a pretty long time for raylet as raylet did not do much heavy load. Maybe disk IO related?

If the heartbeat reporting is hanged by raylet’s load we can separate a single thread for it. If raylet could not get cpu cycle itself I’ve no idea what we can do 😦

0reactions
rkooo567commented, Oct 29, 2020

@yncxcw Actually, can you also try this?

After you start the head / worker nodes, grep your raylet / gcs_server pid and run

# For both worker / head nodes
sudo renice -n -19 [raylet_pid]
# Only for a head node
sudo renice -n -19 [gcs_server pid]

This will give higher OS scheduling priority to raylet and gcs server. I wonder if this will alleviate the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Node mistakenly marked dead: increase heartbeat timeout?
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. Most warnings are in the 500 ms range. But I do...
Read more >
Node.js: Run the Heartbeat in a Child Process - Ex Ratione
It keeps things running along smartly by making use of the downtime spent waiting for I/O operations to complete in order to execute...
Read more >
Heartbeat timeout in Activity - Community Support - Temporal
Activity has Heartbeat timeout parameter. ... Activity don't have retry and task is failed inside heartbeat period and what if outside.
Read more >
Controlling the heartbeat timeout from the client in socket.io
As far as I can tell, there are 2 values that matter here: the server sends heartbeats to the client every heartbeat interval...
Read more >
Heartbeat i/o timeout - Beats - Discuss the Elastic Stack
ECE 2.3 RHEL 7.6 (Maipo) Redis server v=5.0.3 I'm having trouble with an application of ours connecting to Heartbeat.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found