Self-hosted agents intermittently do not pick up new jobs

See original GitHub issue

Having issue with YAML?

No

Having issue with Tasks?

No

Having issue with software on Hosted Agent?

No

Having generic issue with Azure-Pipelines/VSTS/TFS?

No

Have you tried troubleshooting?

Yes

Agent Version and Platform

Version of your agent? 2.175.2

OS of the machine running the agent? CentOS 7

Azure DevOps Type and Version

dev.azure.com

If dev.azure.com, what is your organization name? https://dev.azure.com/ (will provide this privately if necessary)

What’s not working?

We have a series of pipelines that all behave the same way:

First Stage

Second Stage

  • Wait for the self-hosted agent to pick up the work, based on a custom “demand” that looks for the unique agent name
  • Run some custom code on the self-hosted agent
  • Finish running custom code
  • Delete GCP VM

Agent and Worker’s Diagnostic Logs

See the following log files for an example of a successful run, and an unsuccessful run.

self-hosted-agent-log-failure.log self-hosted-agent-log-successful.log

Key differences that I’ve noticed:

The Linux version printed at the top is different, though I’m not exactly sure how/why, and I’m not sure why that would matter for this particular issue:

successful 
[2020-10-28 05:30:47Z INFO AgentProcess] RuntimeInformation: Linux 4.19.150+ #1 SMP Sat Oct 24 07:57:26 PDT 2020.

failure
[2020-10-21 23:01:22Z INFO AgentProcess] RuntimeInformation: Linux 5.4.49+ #1 SMP Sun Oct 18 19:43:35 PDT 2020.

Note that the failure log shows that the agent is listening for jobs but then times out after 30 minutes, but the success log receives the job within 30 seconds

successful
[2020-10-28 05:30:49Z INFO MessageListener] Session created.
[2020-10-28 05:30:49Z INFO Terminal] WRITE LINE: 2020-10-28 05:30:49Z: Listening for Jobs
[2020-10-28 05:30:49Z INFO JobDispatcher] Set agent/worker IPC timeout to 30 seconds.
[2020-10-28 05:31:28Z INFO RSAFileKeyManager] Loading RSA key parameters from file /azp/agent/.credentials_rsaparams
[2020-10-28 05:31:28Z INFO MessageListener] Message '1' received from session 'b2dbac0f-1ab5-45ec-ae40-c811b5d35d0d'.
[2020-10-28 05:31:28Z INFO JobDispatcher] Job request 2037 for plan e2905f74-12be-4282-8fb2-215cd5c5d3f3 job fc308004-fcdd-5de5-2151-99c66bc3b9d8 received.
[2020-10-28 05:31:28Z INFO Terminal] WRITE LINE: 2020-10-28 05:31:28Z: Running job: Build container


failure
[2020-10-21 23:01:23Z INFO RSAFileKeyManager] Loading RSA key parameters from file /azp/agent/.credentials_rsaparams
[2020-10-21 23:01:23Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown
[2020-10-21 23:01:23Z INFO MessageListener] Session created.
[2020-10-21 23:01:23Z INFO Terminal] WRITE LINE: 2020-10-21 23:01:23Z: Listening for Jobs
[2020-10-21 23:01:23Z INFO JobDispatcher] Set agent/worker IPC timeout to 30 seconds.
[2020-10-21 23:31:24Z INFO MessageListener] No message retrieved from session 'dc2f77ad-6fdb-4a0f-b539-f0eefaef1c8d' within last 30 minutes.
[2020-10-21 23:56:26Z WARN VisualStudioServices] Authentication failed with status code 401.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
anatolybolshakovcommented, Apr 7, 2021

@dvmorris @KrylixZA @mk-AVA I’m closing this at the moment due to inactivity - please let us know if it’s still actual for you and provide more details - for us to investigate it further.

1reaction
KrylixZAcommented, Feb 1, 2021

@KrylixZA Do you know the timestamp of hung job? When it should have started, but haven’t?

Relative to the logs I added above, it is between these two logged outputs:

[2021-01-29 13:11:23Z INFO JobDispatcher] Send job request message to worker for job 3bf2df57-857a-5250-2f8f-945c718af65b (30 KB). [2021-01-29 13:11:53Z INFO JobDispatcher] Job request message sending for job 3bf2df57-857a-5250-2f8f-945c718af65b been cancelled after waiting for 30 seconds, kill running worker.

More specifically, it is exactly at 13:11:23Z when the send job request message is made.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Self-hosted agent pools are not running jobs, instead ...
The agents always are able to start up and add themselves to the pool - but none of them actually picks any of...
Read more >
Azure Pipelines Agents
An agent that you set up and manage on your own to run jobs is a self-hosted agent. You can use self-hosted agents...
Read more >
Common Pitfalls of using Self-Hosted Build Agents
While vendor-hosted agents provide a new and clean directory for every build and every job, the default for self-hosted agents is different and ......
Read more >
Self hosted Build Agent taking too much time to run Job ...
Our self-hosted build agent fails on publishing the test results to Azure DevOps (MS Cloud, not on-premise) sometimes.
Read more >
Runners are pending for a while (#27269) · Issues
Most times a new job has to wait several minutes, up to 10-15mn to be picked despite all runners being available. For instance...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found