IcingaDB daemon crashes when it can't reach the psql database. #620

log1-c · 2023-07-24T12:11:56Z

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database.
pgbouncer listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

We experienced some crashes of the daemon with error messages similar to #577
After some testing we noticed the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

The text was updated successfully, but these errors were encountered:

julianbrost · 2023-07-24T15:11:05Z

pq: unexpected message 'E'; expected ReadyForQuery

There's an issue in our database client driver: lib/pq#478. It would be interesting to know which error is hiding in there, so this will need further debugging.

I've also found an interesting comment in another project that claims (I haven't verified this myself but it sounds plausible) that the PostgreSQL wire protocol allows multiple error messages being sent: cockroachdb/cockroach#24149 (comment). So this sounds like it's quite possible this is an issue with the client, not the database cluster.

pq: cannot use serializable mode in a hot standby

According to the PostgreSQL documentation:

When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.

The Icinga DB daemon will always write to the database, which obviously can't work on a hot standby server. Could it be the case that connections are routed to the wrong server? Or is something like all servers are running as hot standby for a short time during a failover operation something that might happen?

log1-c · 2023-07-31T12:11:35Z

I have tested what happens during a switch-over:
Patroni node status:
Leader changes to "unknown" for a short time, new leader changes from "replica" to "leader".

hot_standby is necessary for patroni to work (zalando/patroni#1811 (comment)), so it can't be disabled.

So I'm not sure what to test/debug further.

But I think the IcingaDB daemon should not crash, when it can't connect to the database, but rather cache it's queries, like the IDO feature does/did.

julianbrost · 2023-07-31T12:59:46Z

For the "cannot use serializable mode in a hot standby" part of the issue, treating this error as retryable (similar to how we do for other server is starting/shutting down errors already), but from a quick online search, it looks like error code 0A000 (feature_not_supported) is used for that, so something quite generic (comparing error messages is always ugly, they better don't change ever).

log1-c · 2023-12-19T12:59:19Z

Follow up to this:

Due to the changes mentioned in the linked lib/pq issue the error message is now more specific:

Dec 17 20:18:19 msd-ic-ma02 icingadb[1674]: Starting history sync
Dec 17 20:18:20 msd-ic-ma02 icingadb[1674]: pq: cannot execute INSERT in a read-only transaction
                                            can't perform "INSERT INTO \"state_history\" 
...

We now "tuned" our haproxy config, so that it is faster in switching after detecting a non-functioning connection.
Since then the daemon did not crash on leading node switches.

Al2Klimov · 2023-12-19T15:06:21Z

Colleagues, we could re-try 25006 (read_only_sql_transaction) if we wish, once lib/pq#1136 has been merged. Shall we?

@log1-c Please could you add that code to https://github.com/Icinga/icingadb/blob/v1.1.1/pkg/retry/retry.go#L161-L175 locally and report what happens?

log1-c changed the title ~~IcingaDB daemon crashes when it can't reach the database.~~ IcingaDB daemon crashes when it can't reach the psql database. Jul 31, 2023

Al2Klimov self-assigned this Aug 9, 2023

Al2Klimov mentioned this issue Sep 26, 2023

conn#readReadyForQuery(): on error (E) message throw that specific error lib/pq#1136

Open

log1-c closed this as completed Dec 19, 2023

Al2Klimov removed their assignment Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IcingaDB daemon crashes when it can't reach the psql database. #620

IcingaDB daemon crashes when it can't reach the psql database. #620

log1-c commented Jul 24, 2023

julianbrost commented Jul 24, 2023

log1-c commented Jul 31, 2023

julianbrost commented Jul 31, 2023

log1-c commented Dec 19, 2023

Al2Klimov commented Dec 19, 2023

IcingaDB daemon crashes when it can't reach the psql database. #620

IcingaDB daemon crashes when it can't reach the psql database. #620

Comments

log1-c commented Jul 24, 2023

julianbrost commented Jul 24, 2023

log1-c commented Jul 31, 2023

julianbrost commented Jul 31, 2023

log1-c commented Dec 19, 2023

Al2Klimov commented Dec 19, 2023