OrleansMessageRejection exception and Orleans stream messages stuck in azure storage queue

See original GitHub issue

I’ve encountered this a couple times in the last 1.5 weeks. I’ll deploy a new revision of my Orleans application and within a couple days silos will become unavailable and messages will be undeliverable on some instances. The problematic silos will not recover and I have to restart the cluster to resolve this issue.

When 1 or more of 9 silos get in this state where grain messages can’t be delivered, then Orleans stream messages pushed to the queue will also get stuck until I restart the cluster (container app environment). The issue may have started shortly after the last deployment. The last few times this issue occurred, it seemed to follow shortly after the new release.

I’d appreciate some further guidance on tracking down the issue here.

Here are some further observations:

  • Silo running in Azure container app environment
  • Last revision was deployed 2023-07-07T20:24:08z
  • Very little load over weekend, then Sunday night for seemingly no reason, silos are terminating and messages can’t be delivered from Orleans storage queue.
  • Running Orleans 7.1.2
  • Silo exits with code 1
  • container logs
    • 2023-07-10T05:32:23.4393447Z
      • message: Container silo failed liveness probe, will be restarted
    • 2023-07-10T05:34:30.094215Z
      • message: Container 'silo' was terminated with exit code '1'
  • The most closely related exeptions
    • 2023-07-10T05:34:20.2993575ZOrleans.Runtime.OrleansMessageRejectionException
      • message:
      Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S100.100.0.120:11111:47972033. See InnerException
       ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 100.100.0.120:11111. Error: ConnectionRefused
         at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
         at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
         at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
         --- End of inner exception stack trace ---
         at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
         at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
         at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231
      
    • 2023-07-10T05:34:23.1327681ZSystem.ObjectDisposedException at Orleans.Serialization.Serializers.CodecProvider.GetServiceOrCreateInstance
      • message:
      Cannot access a disposed object.
      Object name: 'IServiceProvider'.
      
    • 2023-07-10T05:37:08.2696465ZSystem.InvalidOperationException at Orleans.Runtime.ActivationData.StartDeactivating
      • message:
      Calling DeactivateOnIdle from within OnActivateAsync is not supported
      

As of 2023-07-10T20:48:15.9391397Z still seeing Orleans.Runtime.OrleansMessageRejectionException and there are 34k orleans messages stuck in queue-1.

Issue Analytics

  • State:open
  • Created 2 months ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
benjaminpetitcommented, Aug 1, 2023

I was reffering about the MessageRejectionException.

The streaming infrastructure is using some internal grains, called PubSubRendezVousGrain. Here is seems the directory is in a bad state, and the cluster isn’t able to create a new activation of the PubSubRendezVousGrain for some streams.

It would be interesting to see if you have more directory related logs.

Also when you scale your cluster up, do you see some silo dying in the meantime?

0reactions
iamsamcodercommented, Aug 18, 2023

Hi @benjaminpetit,

We continue to see these Orleans.Runtime.OrleansMessageRejectionException and only related to the Orleans.Streams.IPubSubRendezvousGrain. It only occurs after an automated scaling up of silo instances, then it resolves after scaling back down.

This causes delays for processing mission critical messages. We are developing an alternative solution that migrates processes depending on Orleans streams to azure functions, but hoping we can find a solution to this.

Can you provide any guidance on this? Do you know of other users that have Orleans streams reliability issues with clusters that periodically scale up and down?

Thank you very much!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stuck messages in azure storage queue
I am using azure storage queue and have a .net core console application, which is continuously polling on this queue and picking the ......
Read more >
Azure Queue streams overview
Explore the streaming implementation with Azure Queue in .NET Orleans.
Read more >
Azure storage queue: processed message comes back to ...
I have an Azure function triggered by the storage queue, and once the message processed successfully by the function, after 10 minutes, ...
Read more >
Azure Storage Queue client library for Java
Azure Queue storage is a service for storing large numbers of messages that can be accessed from anywhere in the world via authenticated ......
Read more >
Durable Functions Troubleshooting Guide - Azure
Check the Azure Storage control queues assigned to the stuck orchestrator to see if its "start message" is still there For more information ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found