NpgsqlConnection ended up in a broken state causing random errors.

See original GitHub issue

Steps to reproduce

on a service with ~30 requests / s running for 3 years, this happened the first time and “fixed itself” within a minute

The issue

we just had a weird bug, almost like a connection would have went rogue, we saw the following errors (some of them multiple times):

Unknown message code: with values 0, 1, 12

Received backend message BindComplete while expecting CommandCompleteMessage. Please file a bug.
Received backend message BindComplete while expecting ParseCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting BindCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting ParseCompleteMessage. Please file a bug.
Received backend message CompletedResponse while expecting ReadyForQueryMessage. Please file a bug.
Received backend message DataRow while expecting BindCompleteMessage. Please file a bug.
Received backend message ReadyForQuery while expecting BindCompleteMessage. Please file a bug.
Received backend message ReadyForQuery while expecting CommandCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting BindCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting CommandCompleteMessage. Please file a bug.
Received backend message RowDescription while expecting ReadyForQueryMessage. Please file a bug.

Those errors eventually escalated to the following exceptions:

  • System.InvalidOperationException: This NpgsqlTransaction has completed; it is no longer usable.
  • System.NotSupportedException: Nested/Concurrent transactions aren't supported.

This whole thing lasted for a minute and after which the errors stopped with

Npgsql.NpgsqlException: Exception while reading from stream ---> System.IO.IOException: Unable to read data from the transport connection: Operation timed out. ---> System.Net.Sockets.SocketException: Operation timed out` 

… being the last error. Probably the rogue connection was closed.

We have ~30 requests per second on this service and only a hand full failed, that’s why I think that it was a connection which ended up in a broken state. The DbContext is disposed correctly by Autofac and during that dispose, it should reset the connection and clean it up again, or? - unless something went wrong within the pool maybe?

I understand that not awaiting async requests correctly could lead to the same issue, but given that we have compile time checks for awaits, are running stress tests with a much higher load before deploying and never experienced the issue before, would you have any idea, if there could be some issue in npgsql?

We are also using auto prepared statements on that service, but I chose to create a new issue, instead of commenting to https://github.com/npgsql/npgsql/issues/2482 because I analyzed it quite thoroughly and I couldn’t find anything related to auto prepared statement issues as we are used to seeing.

I can provide you detailed exception stack traces to any of the above mentioned errors, if necessary.

Further technical details

Npgsql version: 4.0.7 PostgreSQL version: 11.1 Operating system: Linux (microsoft asp.net core based docker image)

Other details about my project setup:

We use asp.net core with ef core and autofac. We do use the npgsql execution strategy to automatically retry failed transactions in case of connection issues, but as far as I could see, it was only triggered for the last exception (Operation timed out.).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
rojicommented, Jun 12, 2019

Resetting the connection doesn’t repair the protocol sync? Is there anything I could do, if I catch such an exception to basically throw away the connection completely? I had a quick look but I couldn’t find a public API which would overwrite the pooling setup, but probably I missed something.

There no such thing as resetting a connection - once a connection enters the “broken” state it remains there forever (you can check this via NpgsqlConnection.FullState). After all, if we’ve lost protocol sync, there’s no way to communicate with the server again to restore it 😃

However, Npgsql is supposed to handle this well. The moment that a protocol sync (or I/O) error occurs, Npgsql automatically places the physical connection into the broken state. When you call Close() on such an NpgsqlConnection, it is not returned to the pool, but rather discarded - and a new physical connection will automatically be created as necessary the next time you request for one.

Note that you shouldn’t confuse the physical connection with NpgsqlConnection, which is only a façade. Thus, you can call Open() and Close() multiple times on the same instance of NpgsqlConnection, and operate with a different physical connection under the hood each time. You can examine NpgsqlConnection.ProcessID to find out which backend process you’re connected to - this identifies a physical connection.

I was wondering, if I could somehow log the currently used connection (id) for a given request

Yes, log NpgsqlConnection.ProcessID as mentioned above (note that Npgsql’s internal logging does so automatically). It’s a good idea to do this: prepared statements are bound and tracked on the physical connection, so if you encounter an error you should definitely track what happened to that specific physical connection.

I’m going to close this now, as unfortunately there’s not enough for us to investigate a possible bug… But please post back here if you have any luck reproducing this, or open a new issue for any autoprepare problems!

1reaction
rojicommented, Jul 22, 2019

@pgrm I looked into it, and you’re indeed right; there’s a bug since Npgsql 4.0 whereby when protocol sync is lost, we no longer break the connection. This is why it continues to generate errors for a long time after the first error occurred. Thanks for raising this - I’ve opened #2540 to track this, and a fix will go in for 4.0.9.

Unfortunately this doesn’t address the original problem, which is even more important - why protocol sync was lost in the first place. I hope you can continue trying to reproduce and narrow down the scenario which triggers this - it’s very important we find it and fix it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NpgsqlConnection fails when database has been dropped ...
it seems that when the connection is being created for the second time, NpgSql sees it's a previously used connection, so it reuses...
Read more >
psql: server closed the connection unexepectedly
It turns out it is because there was a mismatch between the postgre SQL version between my local and the server, installing the...
Read more >
Recover from Postgres FATAL: could not open file pg_tblspc
e.g.: I have lost the directory where the symlink in "pg_tblspc/176967555" linked to. Well. Status: After a restart, that DBMS comes up with...
Read more >
TCP keepalive for a better PostgreSQL experience - CYBERTEC
UPDATED June 2023 - Describes how you can configure TCP keepalive to keep idle PostgreSQL connections open and to detect broken connections.
Read more >
Exceptions, errors and notices
When Npgsql itself encounters an error, it typically raises that as an NpgsqlException directly, possibly wrapping an inner exception. For example, if a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found