Self hosted runners for GitHub actions fail very often on

See original GitHub issue

Very often, the self-hosted runners fail with this message:

The self-hosted runner: Airflow Runner 32 lost communication with the server. 
Verify the machine is running and has a healthy network connection. 
Anything in your workflow that terminates the runner process, starves it for CPU/Memory, 
or blocks its network access can cause this error. | 

Example failure: https://github.com/apache/airflow/actions/runs/584691417

It happened basically every time (and in many cases more than once) over the last few pushes I’ve done.

I think we need to get to the root cause of it - I suspect this might have something to do with scaling in/out the runners.

Happy to help solving it - I just need to have access to logs @ashb 😃.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Feb 21, 2021

It works much better now! Thanks ! Closing it.

1reaction
ashbcommented, Feb 20, 2021

I have been working on this slowly - my hypothesis is it’s a race condition: when the runner is busy it is protected from scale in, it finishes, gets un-protected from scale in, AWS starts terminating it, but before the instance terminates it picks up a new job. Right in time to get hard killed.

My in progress fix is to use a lifecycle hook to not get killed instantly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Monitoring and troubleshooting self-hosted runners
You can monitor your self-hosted runners to view their activity and diagnose ... If you have any failing checks, you can see more...
Read more >
[Self-hosted] job abandoned #1546 - actions/runner - GitHub
Describe the bug Since yesterday, CI jobs keep failing. I tried to re-run the previously passed changes and still failed.
Read more >
Dealing with jobs failing with "lost communication with the ...
I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this...
Read more >
Checkout action randomly fails on self-hosted runner #333
This issue occurs randomly. Sometimes re-running the action fixes this. Any steps to debug the issue and find the root cause? The error...
Read more >
Workflow failure due to runner shutdown/stoppage · Issue #2040
Since 30 July 2022, our workflow fails with the following message: "The self-hosted runner: ***** lost communication with the server. Verify the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found