Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IcingaDB daemon crashes when it can't reach the psql database. #620

Closed
log1-c opened this issue Jul 24, 2023 · 5 comments
Closed

IcingaDB daemon crashes when it can't reach the psql database. #620

log1-c opened this issue Jul 24, 2023 · 5 comments

Comments

@log1-c
Copy link

log1-c commented Jul 24, 2023

We are running Postgres as a backend for IcingaDB. Postgres is running as a Patroni cluster with 3 nodes. On the Icinga Masters we use pgbouncer and haproxy for the connection to the database.
pgbouncer listens for the connections from icinga and haproxy is setup to present on of the servers on localhost to pgbouncer.

We experienced some crashes of the daemon with error messages similar to #577
After some testing we noticed the following:

If a config deployment (or presumably any action that triggers a reload of the icinga2 service) happens at the time of a switch-over of the leading node, we get the "insert into" error:

Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597b190: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 closing because: server conn crashed? (age=25138s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659739a0: icingadb/icingadb@[::1]:56192 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597acd0: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 closing because: server conn crashed? (age=1499s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d4659734e0: icingadb/icingadb@[::1]:45092 pooler error: server conn crashed?
Jul 17 15:42:49 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 closing because: server conn crashed? (age=4173s)
Jul 17 15:42:49 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:35834 pooler error: server conn crashed?
Jul 17 15:42:50 ma02 pgbouncer[875]: C-0x64d465973020: icingadb/icingadb@[::1]:34438 login attempt: db=icingadb user=icingadb tls=no
Jul 17 15:42:50 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 new connection to server (from 127.0.0.1:43752)
Jul 17 15:42:52 ma02 pgbouncer[875]: S-0x64d46597aa70: icingadb/icingadb@127.0.0.1:6432 closing because: server conn crashed? (age=2s)
Jul 17 15:42:52 ma02 pgbouncer[875]: stats: 0 xacts/s, 1 queries/s, in 556 B/s, out 75 B/s, xact 37939 us, query 9738 us, wait 0 us
Jul 17 15:42:53 ma02 icingadb[205409]: pq: unexpected message 'E'; expected ReadyForQuery#012can't perform "INSERT INTO \"state_history\" (\"state_type\", \"check_source\", \"id\", \"previous_soft_state\", \"max_check_attempts\", \"previous_hard_state\", \"check_attempt\", \"scheduling_source\", \"endpoint_id\", \"host_id\", \"object_type\", \"service_id\", \"event_time\", \"soft_state\", \"hard_state\", \"output\", \"long_output\", \"environment_id\") VALUES (:state_type,:check_source,:id,:previous_soft_state,:max_check_attempts,:previous_hard_state,:check_attempt,:scheduling_source,:endpoint_id,:host_id,:object_type,:service_id,:event_time,:soft_state,:hard_state,:output,:long_output,:environment_id) ON CONFLICT ON CONSTRAINT pk_state_history DO UPDATE SET \"id\" = EXCLUDED.\"id\""#012github.com/icinga/icingadb/internal.CantPerformQuery#012#011github.com/icinga/icingadb/internal/internal.go:30#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:394#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*DB).NamedBulkExec.func1.1.1#012#011github.com/icinga/icingadb/pkg/icingadb/db.go:389#012golang.org/x/sync/errgroup.(*Group).Go.func1#012#011golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57#012runtime.goexit#012#011runtime/asm_amd64.s:1594
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 17 15:42:53 ma02 systemd[1]: icingadb.service: Failed with result 'exit-code'.

Without a config deploy at switch-over time the error message changes to:

Jul 17 15:48:06 ma02 icingadb[348949]: pq: cannot use serializable mode in a hot standby#012can't start transaction#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:253#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:45#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012can't retry#012github.com/icinga/icingadb/pkg/retry.WithBackoff#012#011github.com/icinga/icingadb/pkg/retry/retry.go:64#012github.com/icinga/icingadb/pkg/icingadb.(*HA).realize#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:245#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:211#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA aborted#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:128#012sync.(*Once).doSlow#012#011sync/once.go:74#012sync.(*Once).Do#012#011sync/once.go:65#012github.com/icinga/icingadb/pkg/icingadb.(*HA).abort#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:126#012github.com/icinga/icingadb/pkg/icingadb.(*HA).controller#012#011github.com/icinga/icingadb/pkg/icingadb/ha.go:218#012runtime.goexit#012#011runtime/asm_amd64.s:1594#012HA exited with an error#012main.run#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:335#012main.main#012#011github.com/icinga/icingadb/cmd/icingadb/main.go:37#012runtime.main#012#011runtime/proc.go:250#012runtime.goexit#012#011runtime/asm_amd64.s:1594

Looks like the icingadb daemon doesn't like it if the db cluster isn't available for even a short time.

@julianbrost
Copy link
Contributor

pq: unexpected message 'E'; expected ReadyForQuery

There's an issue in our database client driver: lib/pq#478. It would be interesting to know which error is hiding in there, so this will need further debugging.

I've also found an interesting comment in another project that claims (I haven't verified this myself but it sounds plausible) that the PostgreSQL wire protocol allows multiple error messages being sent: cockroachdb/cockroach#24149 (comment). So this sounds like it's quite possible this is an issue with the client, not the database cluster.

pq: cannot use serializable mode in a hot standby

According to the PostgreSQL documentation:

When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.

The Icinga DB daemon will always write to the database, which obviously can't work on a hot standby server. Could it be the case that connections are routed to the wrong server? Or is something like all servers are running as hot standby for a short time during a failover operation something that might happen?

@log1-c
Copy link
Author

log1-c commented Jul 31, 2023

I have tested what happens during a switch-over:
Patroni node status:
Leader changes to "unknown" for a short time, new leader changes from "replica" to "leader".

hot_standby is necessary for patroni to work (zalando/patroni#1811 (comment)), so it can't be disabled.

So I'm not sure what to test/debug further.

But I think the IcingaDB daemon should not crash, when it can't connect to the database, but rather cache it's queries, like the IDO feature does/did.

@log1-c log1-c changed the title IcingaDB daemon crashes when it can't reach the database. IcingaDB daemon crashes when it can't reach the psql database. Jul 31, 2023
@julianbrost
Copy link
Contributor

For the "cannot use serializable mode in a hot standby" part of the issue, treating this error as retryable (similar to how we do for other server is starting/shutting down errors already), but from a quick online search, it looks like error code 0A000 (feature_not_supported) is used for that, so something quite generic (comparing error messages is always ugly, they better don't change ever).

@log1-c
Copy link
Author

log1-c commented Dec 19, 2023

Follow up to this:

Due to the changes mentioned in the linked lib/pq issue the error message is now more specific:

Dec 17 20:18:19 msd-ic-ma02 icingadb[1674]: Starting history sync
Dec 17 20:18:20 msd-ic-ma02 icingadb[1674]: pq: cannot execute INSERT in a read-only transaction
                                            can't perform "INSERT INTO \"state_history\" 
...

We now "tuned" our haproxy config, so that it is faster in switching after detecting a non-functioning connection.
Since then the daemon did not crash on leading node switches.

@log1-c log1-c closed this as completed Dec 19, 2023
@Al2Klimov
Copy link
Member

Colleagues, we could re-try 25006 (read_only_sql_transaction) if we wish, once lib/pq#1136 has been merged. Shall we?

@log1-c Please could you add that code to https://github.com/Icinga/icingadb/blob/v1.1.1/pkg/retry/retry.go#L161-L175 locally and report what happens?

@Al2Klimov Al2Klimov removed their assignment Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants