AzureCLI@2 task running az webapp create-remote-connection hangs ubuntu agent due to child processes not terminating
See original GitHub issueRequired Information
Type: Bug or Feature (?)
Task Name: AzureCLI@2
Environment
- Server - Azure Pipelines
- Agent - Hosted - ubuntu
Issue Description
AzureCLI@2 and az webapp create-remote-connection
There appears to be pitfalls in attempting to establish the SSH SCM bridge connection in context of an Azure CLI agent task with obvious solutions not forthcoming.
az webapp create-remote-connection in bash with either & (background task) or in pwsh/pscore Start-Process will just hang the task waiting for child processes to exit - and unless you’re familiar with the agent’s underlying process model, you’re in for a nightmare ride.
It is not obvious to me how az webapp ssh is best used as it doesn’t appear to behave the same as just ssh. I imagine it is just for opening an interactive shell session (webapp ssh) and the other is just for a background tcp tunnel (webapp create-remote-connection).
Note: Not to be confused with
remote-connection createwhich also lurks in the ‘extension reference’ documentation.
This is primarily a usability concern - if attempting SSH connections using this tool and with this task (which is reasonable given its purpose), the resulting processes spawned az does not appear to sit well with how the task is managing its process groups on the agent.
Without using setsid, or pgreping for, in this instance, “webapp”, you cannot reliably terminate all the spawned processes. The spawned process which hangs the agent appears to possibly be python3.
Such issues could be circumvented if, say, Add task to run bash scripts on App Service slots were adopted.
Is it the responsibility of this task to better manage the parent process group, such that it doesn’t hang the agent on finalizing the task?
Is there a recommended approach worth documenting for working with these types of commands in the context of AzureCLI@2 ?
Could this task be improved with some task options regarding the process model?
Troubleshooting
I used the following inline bash script to troubleshoot what was going on, after the calls to az webapp create-remote-connection, to query what jobs and processes were actually running at various stages, with the intention of running just a dummy helloworld.sh on a target app service (being a custom docker container that has had SSH enabled as per documentation, and that it has a custom ssh id_agent.pub key in its list of authorized_keys for root, to enable passwordless ssh scripting):
shopt
echo Initial processes
ps ax --forest -O pgid,tpgid
curl https://___appservicename___.azurewebsites.net/robots123345.txt
chmod 600 deploy/docker/id_agent
setsid sh -c "az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup___ -p 2222" &
sleep 30s
mkdir ~/.ssh
chmod 700 ~/.ssh
ssh-keyscan -4 -p 2222 -t rsa localhost >> ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
ssh -T -l root -i deploy/docker/id_agent -p 2222 localhost < deploy/scripts/helloworld.sh
echo Before killing all webapp processes
ps ax --forest -O pgid,tpgid
jobs
jobs -p
kill -- -$(jobs -p)
sleep 10s
echo After kill by PGID webapp processes
ps ax --forest -O pgid,tpgid
jobs
jobs -p
Workaround
Execute the command kill -- -$(jobs -p) which SIGTERMs the processes in the process group of the started background job.
I also had success with kill $(pgrep -f webapp) before I analyzed the process group behaviour and decided to call instead setsid. This seems to be the best solution for Powershell Core also as I could not find cmdlets that dealt with process groups.
Others have worked around using expect.
Error logs
I hope these error logs may help people trace similar issues they may suffer.
Without the workaround, the task would be stuck at 100%, and never finishing, with the final log entry of note being being either:
The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/pwsh'. This may indicate a child process inherited the STDIO streams and has not yet exited.
The STDIO streams did not close within 10 seconds of the exit event from process '/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.
Task logs
Here is the agent’s process stack at various stages:
Before create-remote-connection:
PID PGID TPGID S TTY TIME COMMAND
....
1407 1407 -1 S ? 00:00:01 /opt/vsts/provisioner/provisioner
2802 1407 -1 S ? 00:00:02 \_ /home/vsts/agents/2.174.1/bin/Agent.Listener run
2843 1407 -1 S ? 00:00:04 \_ /home/vsts/agents/2.174.1/bin/Agent.Worker spawnclient 116 119
3400 1407 -1 S ? 00:00:00 \_ /home/vsts/agents/2.174.1/externals/node/bin/node /home/vsts/work/_tasks/AzureCLI_46e4be58-730b-4389-8a2f-ea10b3e5e815/2.0.16/azureclitask.js
3460 1407 -1 S ? 00:00:00 \_ /bin/bash /home/vsts/work/_temp/azureclitaskscript1600125574374.sh
3461 1407 -1 R ? 00:00:00 \_ ps ax --forest -O pgid,tpgid
After create-remote-connection - this is with setsid. Without setsid, the PGID would have been 1407 and there would have been no wrapping sh process:
PID PGID TPGID S TTY TIME COMMAND
....
1407 1407 -1 S ? 00:00:02 /opt/vsts/provisioner/provisioner
2802 1407 -1 S ? 00:00:02 \_ /home/vsts/agents/2.174.1/bin/Agent.Listener run
2843 1407 -1 S ? 00:00:05 \_ /home/vsts/agents/2.174.1/bin/Agent.Worker spawnclient 116 119
3400 1407 -1 S ? 00:00:00 \_ /home/vsts/agents/2.174.1/externals/node/bin/node /home/vsts/work/_tasks/AzureCLI_46e4be58-730b-4389-8a2f-ea10b3e5e815/2.0.16/azureclitask.js
3460 1407 -1 S ? 00:00:00 \_ /bin/bash /home/vsts/work/_temp/azureclitaskscript1600125574374.sh
3535 3535 -1 S ? 00:00:00 \_ sh -c az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
3537 3535 -1 S ? 00:00:00 | \_ bash /usr/bin/az webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
3538 3535 -1 S ? 00:00:01 | \_ /opt/az/bin/python3 -Im azure.cli webapp create-remote-connection -n ___appservicename___ -g ___resourcegroup____ -p 2222
3582 1407 -1 R ? 00:00:00 \_ ps ax --forest -O pgid,tpgid
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:10 (1 by maintainers)
Top Related StackOverflow Question
Sure I will endeavor to do so this week as the original pipeline has been changed to function correctly, I will recreate another pipeline using a default Microsoft docker image that already has SSH configured.
This issue is stale because it has been open for 180 days with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days