OrleansMessageRejection exception and Orleans stream messages stuck in azure storage queue
See original GitHub issueI’ve encountered this a couple times in the last 1.5 weeks. I’ll deploy a new revision of my Orleans application and within a couple days silos will become unavailable and messages will be undeliverable on some instances. The problematic silos will not recover and I have to restart the cluster to resolve this issue.
When 1 or more of 9 silos get in this state where grain messages can’t be delivered, then Orleans stream messages pushed to the queue will also get stuck until I restart the cluster (container app environment). The issue may have started shortly after the last deployment. The last few times this issue occurred, it seemed to follow shortly after the new release.
I’d appreciate some further guidance on tracking down the issue here.
Here are some further observations:
- Silo running in Azure container app environment
- Last revision was deployed
2023-07-07T20:24:08z - Very little load over weekend, then Sunday night for seemingly no reason, silos are terminating and messages can’t be delivered from Orleans storage queue.
- Running Orleans
7.1.2 - Silo exits with code 1
- container logs
2023-07-10T05:32:23.4393447Z- message:
Container silo failed liveness probe, will be restarted
- message:
2023-07-10T05:34:30.094215Z- message:
Container 'silo' was terminated with exit code '1'
- message:
- The most closely related exeptions
2023-07-10T05:34:20.2993575Z–Orleans.Runtime.OrleansMessageRejectionException- message:
Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S100.100.0.120:11111:47972033. See InnerException ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 100.100.0.120:11111. Error: ConnectionRefused at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54 at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61 at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228 --- End of inner exception stack trace --- at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228 at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108 at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 2312023-07-10T05:34:23.1327681Z–System.ObjectDisposedException at Orleans.Serialization.Serializers.CodecProvider.GetServiceOrCreateInstance- message:
Cannot access a disposed object. Object name: 'IServiceProvider'.2023-07-10T05:37:08.2696465Z–System.InvalidOperationException at Orleans.Runtime.ActivationData.StartDeactivating- message:
Calling DeactivateOnIdle from within OnActivateAsync is not supported
As of 2023-07-10T20:48:15.9391397Z still seeing Orleans.Runtime.OrleansMessageRejectionException and there are 34k orleans messages stuck in queue-1.
Issue Analytics
- State:
- Created 2 months ago
- Comments:10 (4 by maintainers)
Top Related StackOverflow Question
I was reffering about the
MessageRejectionException.The streaming infrastructure is using some internal grains, called
PubSubRendezVousGrain. Here is seems the directory is in a bad state, and the cluster isn’t able to create a new activation of thePubSubRendezVousGrainfor some streams.It would be interesting to see if you have more directory related logs.
Also when you scale your cluster up, do you see some silo dying in the meantime?
Hi @benjaminpetit,
We continue to see these
Orleans.Runtime.OrleansMessageRejectionExceptionand only related to theOrleans.Streams.IPubSubRendezvousGrain. It only occurs after an automated scaling up of silo instances, then it resolves after scaling back down.This causes delays for processing mission critical messages. We are developing an alternative solution that migrates processes depending on Orleans streams to azure functions, but hoping we can find a solution to this.
Can you provide any guidance on this? Do you know of other users that have Orleans streams reliability issues with clusters that periodically scale up and down?
Thank you very much!