NpgsqlConnection ended up in a broken state causing random errors.
See original GitHub issueSteps to reproduce
on a service with ~30 requests / s running for 3 years, this happened the first time and “fixed itself” within a minute
The issue
we just had a weird bug, almost like a connection would have went rogue, we saw the following errors (some of them multiple times):
Unknown message code: with values 0, 1, 12
Received backend message BindComplete while expecting CommandCompleteMessage. Please file a bug.
Received backend message BindComplete while expecting ParseCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting BindCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting ParseCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting ReadyForQueryMessage. Please file a bug.
Received backend message DataRow while expecting BindCompleteMessage. Please file a bug.
Received backend message ReadyForQuery while expecting BindCompleteMessage. Please file a bug.
Received backend message ReadyForQuery while expecting CommandCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting BindCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting CommandCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting ReadyForQueryMessage. Please file a bug.
Those errors eventually escalated to the following exceptions:
System.InvalidOperationException: This NpgsqlTransaction has completed; it is no longer usable.System.NotSupportedException: Nested/Concurrent transactions aren't supported.
This whole thing lasted for a minute and after which the errors stopped with
Npgsql.NpgsqlException: Exception while reading from stream ---> System.IO.IOException: Unable to read data from the transport connection: Operation timed out. ---> System.Net.Sockets.SocketException: Operation timed out`
… being the last error. Probably the rogue connection was closed.
We have ~30 requests per second on this service and only a hand full failed, that’s why I think that it was a connection which ended up in a broken state. The DbContext is disposed correctly by Autofac and during that dispose, it should reset the connection and clean it up again, or? - unless something went wrong within the pool maybe?
I understand that not awaiting async requests correctly could lead to the same issue, but given that we have compile time checks for awaits, are running stress tests with a much higher load before deploying and never experienced the issue before, would you have any idea, if there could be some issue in npgsql?
We are also using auto prepared statements on that service, but I chose to create a new issue, instead of commenting to https://github.com/npgsql/npgsql/issues/2482 because I analyzed it quite thoroughly and I couldn’t find anything related to auto prepared statement issues as we are used to seeing.
I can provide you detailed exception stack traces to any of the above mentioned errors, if necessary.
Further technical details
Npgsql version: 4.0.7 PostgreSQL version: 11.1 Operating system: Linux (microsoft asp.net core based docker image)
Other details about my project setup:
We use asp.net core with ef core and autofac. We do use the npgsql execution strategy to automatically retry failed transactions in case of connection issues, but as far as I could see, it was only triggered for the last exception (Operation timed out.).
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top Related StackOverflow Question
There no such thing as resetting a connection - once a connection enters the “broken” state it remains there forever (you can check this via
NpgsqlConnection.FullState). After all, if we’ve lost protocol sync, there’s no way to communicate with the server again to restore it 😃However, Npgsql is supposed to handle this well. The moment that a protocol sync (or I/O) error occurs, Npgsql automatically places the physical connection into the broken state. When you call
Close()on such an NpgsqlConnection, it is not returned to the pool, but rather discarded - and a new physical connection will automatically be created as necessary the next time you request for one.Note that you shouldn’t confuse the physical connection with NpgsqlConnection, which is only a façade. Thus, you can call
Open()andClose()multiple times on the same instance of NpgsqlConnection, and operate with a different physical connection under the hood each time. You can examineNpgsqlConnection.ProcessIDto find out which backend process you’re connected to - this identifies a physical connection.Yes, log
NpgsqlConnection.ProcessIDas mentioned above (note that Npgsql’s internal logging does so automatically). It’s a good idea to do this: prepared statements are bound and tracked on the physical connection, so if you encounter an error you should definitely track what happened to that specific physical connection.I’m going to close this now, as unfortunately there’s not enough for us to investigate a possible bug… But please post back here if you have any luck reproducing this, or open a new issue for any autoprepare problems!
@pgrm I looked into it, and you’re indeed right; there’s a bug since Npgsql 4.0 whereby when protocol sync is lost, we no longer break the connection. This is why it continues to generate errors for a long time after the first error occurred. Thanks for raising this - I’ve opened #2540 to track this, and a fix will go in for 4.0.9.
Unfortunately this doesn’t address the original problem, which is even more important - why protocol sync was lost in the first place. I hope you can continue trying to reproduce and narrow down the scenario which triggers this - it’s very important we find it and fix it.