AzureCLI@2 task running az webapp create-remote-connection hangs ubuntu agent due to child processes not terminating

See original GitHub issue

Required Information

Type: Bug or Feature (?)

Task Name: AzureCLI@2

Environment

  • Server - Azure Pipelines
  • Agent - Hosted - ubuntu

Issue Description

AzureCLI@2 and az webapp create-remote-connection

There appears to be pitfalls in attempting to establish the SSH SCM bridge connection in context of an Azure CLI agent task with obvious solutions not forthcoming.

az webapp create-remote-connection in bash with either & (background task) or in pwsh/pscore Start-Process will just hang the task waiting for child processes to exit - and unless you’re familiar with the agent’s underlying process model, you’re in for a nightmare ride.

It is not obvious to me how az webapp ssh is best used as it doesn’t appear to behave the same as just ssh. I imagine it is just for opening an interactive shell session (webapp ssh) and the other is just for a background tcp tunnel (webapp create-remote-connection).

Note: Not to be confused with remote-connection create which also lurks in the ‘extension reference’ documentation.

This is primarily a usability concern - if attempting SSH connections using this tool and with this task (which is reasonable given its purpose), the resulting processes spawned az does not appear to sit well with how the task is managing its process groups on the agent.

Without using setsid, or pgreping for, in this instance, “webapp”, you cannot reliably terminate all the spawned processes. The spawned process which hangs the agent appears to possibly be python3.

Such issues could be circumvented if, say, Add task to run bash scripts on App Service slots were adopted.

Is it the responsibility of this task to better manage the parent process group, such that it doesn’t hang the agent on finalizing the task?

Is there a recommended approach worth documenting for working with these types of commands in the context of AzureCLI@2 ?

Could this task be improved with some task options regarding the process model?

Troubleshooting

I used the following inline bash script to troubleshoot what was going on, after the calls to az webapp create-remote-connection, to query what jobs and processes were actually running at various stages, with the intention of running just a dummy helloworld.sh on a target app service (being a custom docker container that has had SSH enabled as per documentation, and that it has a custom ssh id_agent.pub key in its list of authorized_keys for root, to enable passwordless ssh scripting):

shopt
echo Initial processes
ps ax --forest -O pgid,tpgid
curl https://___appservicename___.azurewebsites.net/robots123345.txt
chmod 600 deploy/docker/id_agent
setsid sh -c "az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup___ -p 2222" &
sleep 30s
mkdir ~/.ssh
chmod 700 ~/.ssh
ssh-keyscan -4 -p 2222 -t rsa localhost >> ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
ssh -T -l root -i deploy/docker/id_agent -p 2222 localhost < deploy/scripts/helloworld.sh
echo Before killing all webapp processes
ps ax --forest -O pgid,tpgid
jobs
jobs -p
kill -- -$(jobs -p)
sleep 10s
echo After kill by PGID webapp processes
ps ax --forest -O pgid,tpgid
jobs
jobs -p

Workaround

Execute the command kill -- -$(jobs -p) which SIGTERMs the processes in the process group of the started background job.

I also had success with kill $(pgrep -f webapp) before I analyzed the process group behaviour and decided to call instead setsid. This seems to be the best solution for Powershell Core also as I could not find cmdlets that dealt with process groups.

Others have worked around using expect.

Error logs

I hope these error logs may help people trace similar issues they may suffer.

Without the workaround, the task would be stuck at 100%, and never finishing, with the final log entry of note being being either:

The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/pwsh'. This may indicate a child process inherited the STDIO streams and has not yet exited.
The STDIO streams did not close within 10 seconds of the exit event from process '/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.

Task logs

Here is the agent’s process stack at various stages:

Before create-remote-connection:

   PID   PGID  TPGID S TTY          TIME COMMAND
....
  1407   1407     -1 S ?        00:00:01 /opt/vsts/provisioner/provisioner
  2802   1407     -1 S ?        00:00:02  \_ /home/vsts/agents/2.174.1/bin/Agent.Listener run
  2843   1407     -1 S ?        00:00:04      \_ /home/vsts/agents/2.174.1/bin/Agent.Worker spawnclient 116 119
  3400   1407     -1 S ?        00:00:00          \_ /home/vsts/agents/2.174.1/externals/node/bin/node /home/vsts/work/_tasks/AzureCLI_46e4be58-730b-4389-8a2f-ea10b3e5e815/2.0.16/azureclitask.js
  3460   1407     -1 S ?        00:00:00              \_ /bin/bash /home/vsts/work/_temp/azureclitaskscript1600125574374.sh
  3461   1407     -1 R ?        00:00:00                  \_ ps ax --forest -O pgid,tpgid

After create-remote-connection - this is with setsid. Without setsid, the PGID would have been 1407 and there would have been no wrapping sh process:

   PID   PGID  TPGID S TTY          TIME COMMAND
....
  1407   1407     -1 S ?        00:00:02 /opt/vsts/provisioner/provisioner
  2802   1407     -1 S ?        00:00:02  \_ /home/vsts/agents/2.174.1/bin/Agent.Listener run
  2843   1407     -1 S ?        00:00:05      \_ /home/vsts/agents/2.174.1/bin/Agent.Worker spawnclient 116 119
  3400   1407     -1 S ?        00:00:00          \_ /home/vsts/agents/2.174.1/externals/node/bin/node /home/vsts/work/_tasks/AzureCLI_46e4be58-730b-4389-8a2f-ea10b3e5e815/2.0.16/azureclitask.js
  3460   1407     -1 S ?        00:00:00              \_ /bin/bash /home/vsts/work/_temp/azureclitaskscript1600125574374.sh
  3535   3535     -1 S ?        00:00:00                  \_ sh -c az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
  3537   3535     -1 S ?        00:00:00                  |   \_ bash /usr/bin/az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
  3538   3535     -1 S ?        00:00:01                  |       \_ /opt/az/bin/python3 -Im azure.cli webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
  3582   1407     -1 R ?        00:00:00                  \_ ps ax --forest -O pgid,tpgid

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
t-l-kcommented, Sep 22, 2020

Sure I will endeavor to do so this week as the original pipeline has been changed to function correctly, I will recreate another pipeline using a default Microsoft docker image that already has SSH configured.

0reactions
github-actions[bot]commented, Oct 26, 2021

This issue is stale because it has been open for 180 days with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hosted agent hangs after test task on Ubuntu 20.04
we've a build pipeline that sporadically hangs at the end after publishing results. About 15% of the pipeline runs fail due to this...
Read more >
Release notes & updates – Azure CLI - Microsoft Learn
Learn about the latest Azure Command-Line Interface (CLI) release notes and updates for both the current and beta versions of the CLI.
Read more >
Azure DevOps AzureCLI@2 create linux webapp
I tested with the latest Azure cli version 2.21.0 and 2.20.0 . They both failed with above error Linux Runtime 'DOTNETCORE|5.0' is not...
Read more >
[SOLVED] Azure Web App with custom docker image SSH
See container logs for debugging. 2022-07-29T10:46:26.349Z INFO - Stopping site webapp-test because it failed during startup.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found