Stub process is unhealthy and it will be restarted
See original GitHub issueI’m getting Stub process is unhealthy and it will be restarted repeatedly when calling infer, after which the server restarts. I have deployed triton server on GKE with 3 models.
1st time when I infer model1 I get this error, 2nd and consequent hits don’t give this error. But if I infer model2 after getting successful result from model1 then again this error pops up and so on for model3.
logs:
responses.append(self.triton_client.infer(
File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1086, in infer
raise_error_grpc(rpc_error)
File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 61, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'damage_0', message: Stub process is not healthy.
I’m loading 3 models, using the python backend and custom triton image (converted detectron models) which I’ve built using this Dockerfile:
FROM nvcr.io/nvidia/tritonserver:21.10-py3
RUN pip3 install torch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 && \
pip3 install pillow
Also, while running triton server locally using docker, I had to increase the shm-size as it was giving error to increase it from 64MB. On Kubernetes its a little tricky, to use emptyDir with Memory medium. My yaml looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: triton-mms
name: triton-mms
spec:
replicas: 1
selector:
matchLabels:
app: triton-mms
template:
metadata:
labels:
app: triton-mms
spec:
containers:
- image: <custom triton image>
command: ["/bin/sh", "-c"]
args: ["tritonserver --model-repository=<gcs model repo>"]
imagePullPolicy: IfNotPresent
name: triton-mms
ports:
- containerPort: 8000
name: http-triton
- containerPort: 8001
name: grpc-triton
- containerPort: 8002
name: metrics-triton
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /secret/gcp-creds.json
resources:
limits:
memory: 5Gi
nvidia.com/gpu: 1
requests:
memory: 5Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: vsecret
mountPath: "/secret"
readOnly: true
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "1024Mi"
- name: vsecret
secret:
secretName: gcpcreds
Never faced this issue before and I’m thinking it might be related to shared memory as I’ve never seen that error too.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:35 (12 by maintainers)
Top Related StackOverflow Question
@Tabrizian we managed to get the model working in torchscript (torch backend) and no longer experience this issue
just enlarge the Memory of container in Kubernetes will solve it