-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write checkable create
& delete
sla history events
#566
base: main
Are you sure you want to change the base?
Conversation
2853ab4
to
87e94ac
Compare
1cc1f09
to
80a76e5
Compare
2fc3663
to
c160354
Compare
2a4824a
to
2717c61
Compare
93d6f3f
to
69bc4ad
Compare
I just had an idea how we could call that type of SLA history after we didn't really come up with good name for this initially: lifecycle |
753dba4
to
f1878aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t force-push for now.
da02b9e
to
aeb8469
Compare
1c48f08
to
af9db71
Compare
// - Start another goroutine that consumes from `deleteEntities` concurrently. When the current sync subject is | ||
// of type checkable, this performs sla lifecycle updates matching the checkables id and `delete_time` 0. When | ||
// there is no tracked `created_at` event for a given checkable, this update is essentially a no-op, but | ||
// forwards the entities nonetheless to the next one `updatedSlaLifeCycles`. | ||
// | ||
// - This stage is a no-op for all sla lifecycle that have a `created_at` and `deleted_at` db records, where | ||
// all duplicated key errors are ignored with the `INSERT ... IGNORE ON ERROR` mechanism. Nevertheless, | ||
// this stage also forwards all entities to the next one. This way we don't need to retrieve data from | ||
// the sla_lifecycle table to check whether a `created_at` event has already been recorded for any | ||
// given checkable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part is also relevant for the initial sync, so my first candidate for moving outsourcing into a common function would have been the whole SLA lifecycle deletion stuff, i.e. the UPDATE query and the following INSERT IGNORE query as a fallback. But probably don't change anything here right now, I'll have another look at this if I can come up with a more concrete idea how to do that.
5000240
to
4566ba4
Compare
4566ba4
to
086f23a
Compare
Now that I'm taking a fresh look at this after some time, I'm wondering: what was the reason for updating rows instead of inserting separate rows for create and delete events? |
I've updated the PR description with a summary of what queries are performed when (section "Implementation").
This also gave sort of an answer to this question: having that database structure with the delete time being part of the primary key prevents duplicate rows to be inserted for runtime updates. |
Admittedly not the most beautiful concept, but it solves problems we'd have with 1 event = 1 row 👍 |
Why do we need this!
Currently we are generating the SLA history events only when e.g. there were state change and downtime start and end events for the checkables. Under some circumstances (if a checkable is created once and never deleted) this should be sufficient. However, when you e.g. delete a host and create it again a couple of days later and want to generate sla reports for this host at the end of the week, the result can vary depending on which state the host had before it was deleted. In order to be able to generate sla reports as accurately as possible, we decided to track the checkable creation and deletion time on top of the existing info. And since Icinga 2 doesn't really know when an object has been deleted (at least not in a simple way), this PR should take care of it.
Though, Icinga DB doesn't know when an object has been deleted either, it just takes the time the delete event for that object arrived and puts it into the new table. Meaning when you delete checkables while Icinga DB is stopped, the events Icinga DB would write after it is started won't reflect the actual delete/create event. Though, there is no better way to handle this gracefully.
Config sync
To avoid additional DB queries during the initial config sync for services for determining their host ids, this PR introduces its own
Fingerprint
interface implementation forService
type. This makes it possible to pre-select all host ids from the database while computing config delta.As Icinga DB could be also stopped (or crushed due to system errors) during the config dump, which would possibly cause some checkables to be created/removed from the regular Icinga DB tables but no events are written to the new table. To avoid such inconsistencies,
SlaLifecycle
queries are executed first and only then the checkables are passed on using the on success mechanism. (This also applies to the runtime eventsupsert
&delete
).Implementation
The new table
sla_history_lifecycle
has a primary key over(id, delete_time)
wheredelete_time=0
means "not deleted yet" (the column has to beNOT NULL
due to being included in the primary key).id
is basically an object identifier (hash over env + host + service IDs). This ensures that there can only be row per object that states that the object is currently alive in Icinga 2.Initial sync
Create
Performs a simple
INSERT
operation withPK=(id, delete_time=0)
(TODO: this should probably have some "on duplicate key ignore" in case the sync was interrupted after writing the lifecycle but before actually inserting the object).Update
Nothing to be done here (object existed before and continues to exist).
Delete
UPDATE
settingdelete_time = now
(i.e. updates the PK of the row) marking the alive row for the object as deleted (if it already exists).INSERT
with ignore for duplicate keys with the same timestamp. So in case there was no row to be updated, it will now be inserted (otherwise, this query is a no-op). This is especially important for the case where objects were created before this feature becomes available.Runtime updates
Upsert
Performs an
INSERT
with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive insla_history_lifecycle
, this will do nothing, otherwise it will mark it as created now (including when an object that was created before this feature was enabled is updated).Delete
Does basically the same as delete during initial sync.