Timeout for Raylet heartbeat for I/O intensive workloads
See original GitHub issueWhat is the problem?
Ray version and other system information (Python version, TensorFlow version, OS):
Ray: Ray-0.8.7
Python: Python-3.7
Tensorflow: Tensorflow-1.4
OS: Ubuntu-16.04 image on K8s
Context:
We are using Ray for data loading, basically the Ray actor loads both images and labels off of the disk and run some preprocessing (mostly numpy stuff).
Stack trace:
(pid=raylet) E1025 09:45:46.570907 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
E1025 09:45:47.035303 9 993 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modulus.multi_task_loader.dataloader.core.data.ray_iterator, class_name=RemoteFetcher, function_name=fetch, function_hash=}, task_id=6cd1c8101e919ffa55d5f0d50100, job_id=0100, num_args=2, num_returns=2, actor_task_spec={actor_id=55d5f0d50100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=1}
e2emapnet-hanging-fix-6jnny4-0-train-9153b4-c29:9:121 [0] NCCL INFO Destroyed comm 0x7f8d080412a0 rank 0
(pid=raylet) E1025 09:45:48.027905 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.027952 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028023 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028045 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028066 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) E1025 09:45:48.028097 837 837 node_manager.cc:3078] Failed to send get core worker stats request: IOError: 14: failed to connect to all addresses
(pid=raylet) F1025 09:45:48.036741 837 837 node_manager.cc:652] Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet) @ 0x5614df845a3d google::LogMessage::Fail()
(pid=raylet) @ 0x5614df846b9c google::LogMessage::SendToLog()
(pid=raylet) @ 0x5614df845719 google::LogMessage::Flush()
(pid=raylet) @ 0x5614df845931 google::LogMessage::~LogMessage()
(pid=raylet) @ 0x5614df7fc379 ray::RayLog::~RayLog()
(pid=raylet) @ 0x5614df54f334 ray::raylet::NodeManager::NodeRemoved()
(pid=raylet) @ 0x5614df54f4ec _ZNSt17_Function_handlerIFvRKN3ray8ClientIDERKNS0_3rpc11GcsNodeInfoEEZNS0_6raylet11NodeManager11RegisterGcsEvEUlS3_S7_E0_E9_M_invokeERKSt9_Any_dataS3_S7_
(pid=raylet) @ 0x5614df636390 ray::gcs::ServiceBasedNodeInfoAccessor::HandleNotification()
(pid=raylet) @ 0x5614df636666 _ZNSt17_Function_handlerIFvRKSsS1_EZZN3ray3gcs28ServiceBasedNodeInfoAccessor26AsyncSubscribeToNodeChangeERKSt8functionIFvRKNS3_8ClientIDERKNS3_3rpc11GcsNodeInfoEEERKS6_IFvNS3_6StatusEEEENKUlSM_E0_clESM_EUlS1_S1_E_E9_M_invokeERKSt9_Any_dataS1_S1_
(pid=raylet) @ 0x5614df640d0a _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray3gcs13CallbackReplyEEEZNS2_9GcsPubSub24ExecuteCommandIfPossibleERKSsRNS6_7ChannelEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
(pid=raylet) @ 0x5614df6427cb _ZN5boost4asio6detail18completion_handlerIZN3ray3gcs20RedisCallbackManager12CallbackItem8DispatchERSt10shared_ptrINS4_13CallbackReplyEEEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
(pid=raylet) @ 0x5614dfb292af boost::asio::detail::scheduler::do_run_one()
(pid=raylet) @ 0x5614dfb2a7b1 boost::asio::detail::scheduler::run()
(pid=raylet) @ 0x5614dfb2b7e2 boost::asio::io_context::run()
(pid=raylet) @ 0x5614df4bbb52 main
(pid=raylet) @ 0x7fe485b9eb97 __libc_start_main
(pid=raylet) @ 0x5614df4cbf91 (unknown)
(pid=25446) F1025 09:45:49.821868 25446 25446 raylet_client.cc:106] Check failed: _s.ok() [RayletClient] Unable to register worker with raylet.: IOError: Connection reset by peer
(pid=25446) *** Check failure stack trace: ***
(pid=25446) @ 0x7f52ca76b6cd google::LogMessage::Fail()
(pid=25446) @ 0x7f52ca76c82c google::LogMessage::SendToLog()
(pid=25446) @ 0x7f52ca76b3a9 google::LogMessage::Flush()
(pid=25446) @ 0x7f52ca76b5c1 google::LogMessage::~LogMessage()
(pid=25446) @ 0x7f52ca722ce9 ray::RayLog::~RayLog()
(pid=25446) @ 0x7f52ca472074 ray::raylet::RayletClient::RayletClient()
(pid=25446) @ 0x7f52ca412f30 ray::CoreWorker::CoreWorker()
(pid=25446) @ 0x7f52ca416e24 ray::CoreWorkerProcess::CreateWorker()
(pid=25446) @ 0x7f52ca417f42 ray::CoreWorkerProcess::CoreWorkerProcess()
(pid=25446) @ 0x7f52ca4188ab ray::CoreWorkerProcess::Initialize()
(pid=25446) @ 0x7f52ca371a7d __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
(pid=25446) @ 0x7f52ca372d05 __pyx_tp_new_3ray_7_raylet_CoreWorker()
(pid=25446) @ 0x551365 (unknown)
(pid=25446) @ 0x5a9cbc _PyObject_FastCallKeywords
(pid=25446) @ 0x50a5c3 (unknown)
(pid=25446) @ 0x50bfb4 _PyEval_EvalFrameDefault
(pid=25446) @ 0x507d64 (unknown)
(pid=25446) @ 0x509a90 (unknown)
(pid=25446) @ 0x50a48d (unknown)
(pid=25446) @ 0x50cd96 _PyEval_EvalFrameDefault
(pid=25446) @ 0x507d64 (unknown)
(pid=25446) @ 0x50ae13 PyEval_EvalCode
(pid=25446) @ 0x634c82 (unknown)
(pid=25446) @ 0x634d37 PyRun_FileExFlags
(pid=25446) @ 0x6384ef PyRun_SimpleFileExFlags
(pid=25446) @ 0x639091 Py_Main
(pid=25446) @ 0x4b0d00 main
(pid=25446) @ 0x7f52cd0d1b97 __libc_start_main
(pid=25446) @ 0x5b250a _start
(pid=25412) E1025 09:45:49.886169 25412 26225 core_worker.cc:691] Raylet failed.
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
It might be hard to reproduce as this might be an issue coupled with our storage system
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Node mistakenly marked dead: increase heartbeat timeout?
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. Most warnings are in the 500 ms range. But I do...
Read more >Node.js: Run the Heartbeat in a Child Process - Ex Ratione
It keeps things running along smartly by making use of the downtime spent waiting for I/O operations to complete in order to execute...
Read more >Heartbeat timeout in Activity - Community Support - Temporal
Activity has Heartbeat timeout parameter. ... Activity don't have retry and task is failed inside heartbeat period and what if outside.
Read more >Controlling the heartbeat timeout from the client in socket.io
As far as I can tell, there are 2 values that matter here: the server sends heartbeats to the client every heartbeat interval...
Read more >Heartbeat i/o timeout - Beats - Discuss the Elastic Stack
ECE 2.3 RHEL 7.6 (Maipo) Redis server v=5.0.3 I'm having trouble with an application of ours connecting to Heartbeat.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
From our logs I also see lots of delayed heartbeat reporting(those WARNINGs, in test environment, host machines are oversold very much).
In this case I’m not sure if raylet really didn’t report heartbeat for 30 seconds. That’s a pretty long time for raylet as raylet did not do much heavy load. Maybe disk IO related?
If the heartbeat reporting is hanged by raylet’s load we can separate a single thread for it. If raylet could not get cpu cycle itself I’ve no idea what we can do 😦
@yncxcw Actually, can you also try this?
After you start the head / worker nodes, grep your raylet / gcs_server pid and run
This will give higher OS scheduling priority to raylet and gcs server. I wonder if this will alleviate the issue.