Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write checkable create & delete sla history events #566

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Feb 23, 2023

Why do we need this!

Currently we are generating the SLA history events only when e.g. there were state change and downtime start and end events for the checkables. Under some circumstances (if a checkable is created once and never deleted) this should be sufficient. However, when you e.g. delete a host and create it again a couple of days later and want to generate sla reports for this host at the end of the week, the result can vary depending on which state the host had before it was deleted. In order to be able to generate sla reports as accurately as possible, we decided to track the checkable creation and deletion time on top of the existing info. And since Icinga 2 doesn't really know when an object has been deleted (at least not in a simple way), this PR should take care of it.

Though, Icinga DB doesn't know when an object has been deleted either, it just takes the time the delete event for that object arrived and puts it into the new table. Meaning when you delete checkables while Icinga DB is stopped, the events Icinga DB would write after it is started won't reflect the actual delete/create event. Though, there is no better way to handle this gracefully.

Config sync

To avoid additional DB queries during the initial config sync for services for determining their host ids, this PR introduces its own Fingerprint interface implementation for Service type. This makes it possible to pre-select all host ids from the database while computing config delta.

As Icinga DB could be also stopped (or crushed due to system errors) during the config dump, which would possibly cause some checkables to be created/removed from the regular Icinga DB tables but no events are written to the new table. To avoid such inconsistencies, SlaLifecycle queries are executed first and only then the checkables are passed on using the on success mechanism. (This also applies to the runtime events upsert & delete).

Implementation

The new table sla_history_lifecycle has a primary key over (id, delete_time) where delete_time=0 means "not deleted yet" (the column has to be NOT NULL due to being included in the primary key). id is basically an object identifier (hash over env + host + service IDs). This ensures that there can only be row per object that states that the object is currently alive in Icinga 2.

Initial sync

Create

Performs a simple INSERT operation with PK=(id, delete_time=0) (TODO: this should probably have some "on duplicate key ignore" in case the sync was interrupted after writing the lifecycle but before actually inserting the object).

Update

Nothing to be done here (object existed before and continues to exist).

Delete

  1. Performs an UPDATE setting delete_time = now (i.e. updates the PK of the row) marking the alive row for the object as deleted (if it already exists).
  2. Additionally, it performs an INSERT with ignore for duplicate keys with the same timestamp. So in case there was no row to be updated, it will now be inserted (otherwise, this query is a no-op). This is especially important for the case where objects were created before this feature becomes available.

Runtime updates

Upsert

Performs an INSERT with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive in sla_history_lifecycle, this will do nothing, otherwise it will mark it as created now (including when an object that was created before this feature was enabled is updated).

Delete

Does basically the same as delete during initial sync.

@cla-bot cla-bot bot added the cla/signed label Feb 23, 2023
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2853ab4 to 87e94ac Compare February 24, 2023 09:14
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 1cc1f09 to 80a76e5 Compare February 27, 2023 12:00
schema/mysql/schema.sql Outdated Show resolved Hide resolved
pkg/icingadb/sync.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 5 times, most recently from 2fc3663 to c160354 Compare March 2, 2023 12:54
@yhabteab yhabteab requested a review from Al2Klimov March 2, 2023 12:56
schema/mysql/schema.sql Outdated Show resolved Hide resolved
schema/mysql/schema.sql Outdated Show resolved Hide resolved
schema/pgsql/schema.sql Outdated Show resolved Hide resolved
schema/mysql/schema.sql Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2a4824a to 2717c61 Compare March 2, 2023 14:41
@yhabteab yhabteab requested a review from Al2Klimov March 2, 2023 14:42
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 93d6f3f to 69bc4ad Compare March 2, 2023 16:37
@yhabteab yhabteab requested review from Al2Klimov and removed request for Al2Klimov March 2, 2023 16:39
@julianbrost
Copy link
Contributor

I just had an idea how we could call that type of SLA history after we didn't really come up with good name for this initially: lifecycle

@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 4 times, most recently from 753dba4 to f1878aa Compare March 3, 2023 09:47
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t force-push for now.

pkg/types/int.go Outdated Show resolved Hide resolved
pkg/icingadb/db.go Outdated Show resolved Hide resolved
pkg/icingadb/db.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
pkg/types/int.go Outdated Show resolved Hide resolved
pkg/icingadb/sla.go Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from da02b9e to aeb8469 Compare June 13, 2023 10:42
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from 1c48f08 to af9db71 Compare June 13, 2023 12:35
pkg/icingadb/sla_lifecycle.go Outdated Show resolved Hide resolved
schema/mysql/schema.sql Outdated Show resolved Hide resolved
tests/object_sync_test.go Outdated Show resolved Hide resolved
tests/object_sync_test.go Outdated Show resolved Hide resolved
tests/object_sync_test.go Outdated Show resolved Hide resolved
pkg/icingadb/sla_lifecycle.go Show resolved Hide resolved
pkg/icingadb/sla_lifecycle.go Outdated Show resolved Hide resolved
pkg/icingadb/sla_lifecycle.go Outdated Show resolved Hide resolved
pkg/icingadb/runtime_updates.go Outdated Show resolved Hide resolved
Comment on lines +166 to +175
// - Start another goroutine that consumes from `deleteEntities` concurrently. When the current sync subject is
// of type checkable, this performs sla lifecycle updates matching the checkables id and `delete_time` 0. When
// there is no tracked `created_at` event for a given checkable, this update is essentially a no-op, but
// forwards the entities nonetheless to the next one `updatedSlaLifeCycles`.
//
// - This stage is a no-op for all sla lifecycle that have a `created_at` and `deleted_at` db records, where
// all duplicated key errors are ignored with the `INSERT ... IGNORE ON ERROR` mechanism. Nevertheless,
// this stage also forwards all entities to the next one. This way we don't need to retrieve data from
// the sla_lifecycle table to check whether a `created_at` event has already been recorded for any
// given checkable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part is also relevant for the initial sync, so my first candidate for moving outsourcing into a common function would have been the whole SLA lifecycle deletion stuff, i.e. the UPDATE query and the following INSERT IGNORE query as a fallback. But probably don't change anything here right now, I'll have another look at this if I can come up with a more concrete idea how to do that.

@julianbrost julianbrost added this to the 1.2.0 milestone Jun 14, 2023
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 5000240 to 4566ba4 Compare June 15, 2023 17:54
@yhabteab yhabteab force-pushed the add-create-delete-history-events branch from 4566ba4 to 086f23a Compare June 16, 2023 08:12
@julianbrost
Copy link
Contributor

Now that I'm taking a fresh look at this after some time, I'm wondering: what was the reason for updating rows instead of inserting separate rows for create and delete events?

@julianbrost julianbrost modified the milestones: 1.1.1, 1.2.0 Jul 27, 2023
@julianbrost
Copy link
Contributor

I've updated the PR description with a summary of what queries are performed when (section "Implementation").

Now that I'm taking a fresh look at this after some time, I'm wondering: what was the reason for updating rows instead of inserting separate rows for create and delete events?

This also gave sort of an answer to this question: having that database structure with the delete time being part of the primary key prevents duplicate rows to be inserted for runtime updates.

@Al2Klimov
Copy link
Member

Admittedly not the most beautiful concept, but it solves problems we'd have with 1 event = 1 row 👍

@julianbrost julianbrost removed this from the 1.2.0 milestone Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants