OrleansMessageRejection exception and Orleans stream messages stuck in azure storage queue

See original GitHub issue

I’ve encountered this a couple times in the last 1.5 weeks. I’ll deploy a new revision of my Orleans application and within a couple days silos will become unavailable and messages will be undeliverable on some instances. The problematic silos will not recover and I have to restart the cluster to resolve this issue.

When 1 or more of 9 silos get in this state where grain messages can’t be delivered, then Orleans stream messages pushed to the queue will also get stuck until I restart the cluster (container app environment). The issue may have started shortly after the last deployment. The last few times this issue occurred, it seemed to follow shortly after the new release.

I’d appreciate some further guidance on tracking down the issue here.

Here are some further observations:

Silo running in Azure container app environment
Last revision was deployed 2023-07-07T20:24:08z
Very little load over weekend, then Sunday night for seemingly no reason, silos are terminating and messages can’t be delivered from Orleans storage queue.
Running Orleans 7.1.2
Silo exits with code 1
container logs
- 2023-07-10T05:32:23.4393447Z
  - message: Container silo failed liveness probe, will be restarted
- 2023-07-10T05:34:30.094215Z
  - message: Container 'silo' was terminated with exit code '1'

The most closely related exeptions

2023-07-10T05:34:20.2993575Z – Orleans.Runtime.OrleansMessageRejectionException

message:

Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S100.100.0.120:11111:47972033. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 100.100.0.120:11111. Error: ConnectionRefused
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 228
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 108
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 231

2023-07-10T05:34:23.1327681Z – System.ObjectDisposedException at Orleans.Serialization.Serializers.CodecProvider.GetServiceOrCreateInstance
- message:
```
Cannot access a disposed object.
Object name: 'IServiceProvider'.
```
2023-07-10T05:37:08.2696465Z – System.InvalidOperationException at Orleans.Runtime.ActivationData.StartDeactivating
- message:
```
Calling DeactivateOnIdle from within OnActivateAsync is not supported
```

As of 2023-07-10T20:48:15.9391397Z still seeing Orleans.Runtime.OrleansMessageRejectionException and there are 34k orleans messages stuck in queue-1.

Issue Analytics

State:
Created 2 months ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

benjaminpetitcommented, Aug 1, 2023

I was reffering about the MessageRejectionException.

The streaming infrastructure is using some internal grains, called PubSubRendezVousGrain. Here is seems the directory is in a bad state, and the cluster isn’t able to create a new activation of the PubSubRendezVousGrain for some streams.

It would be interesting to see if you have more directory related logs.

Also when you scale your cluster up, do you see some silo dying in the meantime?

0reactions

iamsamcodercommented, Aug 18, 2023

Hi @benjaminpetit,

We continue to see these Orleans.Runtime.OrleansMessageRejectionException and only related to the Orleans.Streams.IPubSubRendezvousGrain. It only occurs after an automated scaling up of silo instances, then it resolves after scaling back down.

This causes delays for processing mission critical messages. We are developing an alternative solution that migrates processes depending on Orleans streams to azure functions, but hoping we can find a solution to this.

Can you provide any guidance on this? Do you know of other users that have Orleans streams reliability issues with clusters that periodically scale up and down?

Thank you very much!