Stub process is unhealthy and it will be restarted

See original GitHub issue

I’m getting Stub process is unhealthy and it will be restarted repeatedly when calling infer, after which the server restarts. I have deployed triton server on GKE with 3 models.

1st time when I infer model1 I get this error, 2nd and consequent hits don’t give this error. But if I infer model2 after getting successful result from model1 then again this error pops up and so on for model3.

logs:

responses.append(self.triton_client.infer(
      File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1086, in infer
        raise_error_grpc(rpc_error)
      File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 61, in raise_error_grpc
        raise get_error_grpc(rpc_error) from None
    tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'damage_0', message: Stub process is not healthy.

I’m loading 3 models, using the python backend and custom triton image (converted detectron models) which I’ve built using this Dockerfile:

FROM nvcr.io/nvidia/tritonserver:21.10-py3

RUN pip3 install torch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 && \
    pip3 install pillow

Also, while running triton server locally using docker, I had to increase the shm-size as it was giving error to increase it from 64MB. On Kubernetes its a little tricky, to use emptyDir with Memory medium. My yaml looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: triton-mms
  name: triton-mms
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-mms
  template:
    metadata:
      labels:
        app: triton-mms
    spec:
      containers:
      - image: <custom triton image>
        command: ["/bin/sh", "-c"]
        args: ["tritonserver --model-repository=<gcs model repo>"]
        imagePullPolicy: IfNotPresent
        name: triton-mms
        ports:
        - containerPort: 8000
          name: http-triton
        - containerPort: 8001
          name: grpc-triton
        - containerPort: 8002
          name: metrics-triton
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /secret/gcp-creds.json
        resources:
          limits:
            memory: 5Gi
            nvidia.com/gpu: 1
          requests:
            memory: 5Gi
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: vsecret
          mountPath: "/secret"
          readOnly: true
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "1024Mi"
      - name: vsecret
        secret:
          secretName: gcpcreds

Never faced this issue before and I’m thinking it might be related to shared memory as I’ve never seen that error too.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:35 (12 by maintainers)

Top GitHub Comments

3reactions

s-rogcommented, Feb 25, 2022

@Tabrizian we managed to get the model working in torchscript (torch backend) and no longer experience this issue

1reaction

zhyj3038commented, Feb 24, 2022

just enlarge the Memory of container in Kubernetes will solve it

Top Results From Across the Web

[SOLVED] The Stub Received Bad Data Error Problem Issue

In the first method, we try to perform a system file scan to fix The Stub Received Bad Data windows 10 issue. We...

Triton Inference Server: The Basics and a Quick Tutorial

Learn about the NVIDIA Triton Inference Server, its key features, models and model repositories, client libraries, and get started with a quick tutorial....

grpc/grpc - Gitter

AbstractStub $StubFactory , the grpc-core jar file doesn't contain that function for some reason in the version that github readme specified. I' ...

Troubleshooting stubs - IBM

The most likely cause is that the client already received a reply from another Send Reply action within the stub and its connection...

Cascading Failures - Google - Site Reliability Engineering

For example, if a service was healthy at 10,000 QPS, but started a cascading failure due to crashes at 11,000 QPS, dropping the...