io.grpc.Context getting cancelled

See original GitHub issue

Describe the bug Seeing grpc context timeouts with 1.4.1 and 1.5.3.

Steps to reproduce No specific recipe to share at this point. But this appears to be related to few other issues with Google client APIs.

What did you expect to see? GRPC context timeout not getting affected by Otel instrumentation.

What did you see instead?

CANCELLED: io.grpc.Context was cancelled without error.
Stack: io.grpc.Status.asRuntimeException(Status.java:535),
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478),
interceptor-for-ClientCall.Listener<RespT>: onClose()
interceptor-for-ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>: onClose()
...(truncated)...
(TracingClientInterceptor.java:139),
(DelayedClientCall.java:463),
(DelayedClientCall.java:427),
(DelayedClientCall.java:460),
(ClientCallImpl.java:553),
(ClientCallImpl.java:68),
(ClientCallImpl.java:212),
(ContextRunnable.java:37),

What version are you using? Tried with both 1.4.1 and 1.5.3 with the same result. Issue does not happen without instrumentation.

Environment JDK11

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:35 (28 by maintainers)

github_iconTop GitHub Comments

6reactions
ryandenscommented, Mar 2, 2022

👋 Hi all, I was able to recreate the bug. I have a reproducer project available in this repo: ryandens/otel-grpc-context

I’m hopeful I’ll have time in the next few days to investigate and determine the source of the issue, but figured I’d give someone a starting point if they beat me to it. I was able to confirm that this error is not present in 1.1.0 but is present in 1.2.0.

3reactions
anuraagacommented, Mar 11, 2022

After looking into this, I have a hypothesis that the instrumentation is working as intended - but in some sense, it is working too well and perhaps we do need to change it somehow.

When looking at this type of code pattern inside a gRPC handler

class Greeter {
  ExecutorService executor = Executors.newFixedThreadPool();

  sayHello() {
    responseObserver.onCompleted();
    executor.execute(() -> backend.sayGoodbye());
  }

It is in strict terms, “incorrect” gRPC code - gRPC context hasn’t been propagated into the executor. This means that deadline is not propagated as it’s supposed to be, and if using OpenCensus (gRPC’s native instrumentation), the trace would be broken. Similary, if using OpenTelemetry’s gRPC library instrumentation, the trace would be broken due to the lack of context propagation.

For this code to become correct, it needs to be something like Executor executor = Context.currentContextExecutor(Executors.newFixedThreadPool());. With deadline correctly propagated into the threadpool, this would cause the backend request to fail with the cancellation exception, no OpenTelemetry involved. Presumably you would call Context.fork() in the business logic to fix this business logic bug. Note that with such a change, adding OpenTelemetry library instrumentation would also just work - we sync the context between gRPC and OpenTelemetry because we want people to be able to add in the library instrumentation interceptors and have applications using gRPC context propagation to work out of the box, especially to help with OpenCensus migration.

Our problem here is that the javaagent is able to automatically propagate context for threadpool executors. Because we sync OpenTelemetry and gRPC context, this means that the gRPC deadline is also automatically propagated - so code that did not fail due to deadline issues will now fail. The early return pattern we see in the repro is a trival way of reproducing such code, but even traditional synchronous gRPC calls can fail if the client used .withDeadline() and the deadline got propagated to a location where it wasn’t intended. Usually it is intended though.

So the javaagent is making the app “better” by propagating deadline to more places it should be 😃 But we still have a rule that the javaagent can’t introduce exceptions to code that previously didn’t. I think I have reasoned out a workaround for the javaagent and let’s see how it goes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

gRPC: Random CANCELLED exception on RPC calls
gRPC creates a cancellable context for each inbound RPC. When that RPC completes (basically, when the server responds with a onCompleted()/ ...
Read more >
io.grpc.StatusRuntimeException: CANCELLED - Google Groups
This sounds like auto-cancellation. When a RPC completes on the server, the server's io.grpc.Context is cancelled. That propagates to any outgoing RPCs. This...
Read more >
io.grpc.Context.cancellationCause java code examples
If the context is not yet cancelled will always return null. The cancellation cause is provided for informational purposes only and implementations should ......
Read more >
Context.CancellableContext (grpc-all 1.51.0 API)
A context which inherits cancellation from its parent but which can also be independently cancelled and which will propagate cancellation to its descendants ......
Read more >
gRPC - Timeouts & Cancellation - Tutorialspoint
gRPC supports canceling of requests both from the client as well as the server side. The Client can specify during runtime the amount...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found