Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle template errors better #136

Conversation

JustinKuli
Copy link
Member

@JustinKuli JustinKuli commented Apr 23, 2024

It appears that "special" compliance events related to template-errors can potentially be superseded by later events coming from a policy controller. Specifically, if a template is valid and gets created, but is later edited to have an error, the original template is not removed. Because undesired side-effects could be caused by removing a template (eg PruneObjectBehavior in a ConfigurationPolicy), it would be risky to "fix" this by removing the template in this situation. Instead, this PR proposes that the template-sync always re-reconcile after the status is updated, which allows it to correct the "late" event by emitting a new template-error.

The increased number of reconciles does increase the possibility of duplicate compliance events. Accordingly, some tests have been relaxed, and the duplicate detection process now refreshes the policy to try and decrease the probability. It also fixes a bug that would cause a large number of duplicated events in this scheme, because the event was being recorded on the wrong item in the history.

This PR also handles hub-template errors here, instead of in individual policy controllers.

Refs:

@mprahl
Copy link
Member

mprahl commented Apr 24, 2024

@JustinKuli a bit of an odd thought, but in a template error situation where the current configuration policy has pruneObjectBehavior, could the template-sync disable pruneObjectBehavior and then delete it so that a template error situation always results in the policy template being deleted?

@JustinKuli
Copy link
Member Author

JustinKuli commented Apr 24, 2024

@JustinKuli a bit of an odd thought, but in a template error situation where the current configuration policy has pruneObjectBehavior, could the template-sync disable pruneObjectBehavior and then delete it so that a template error situation always results in the policy template being deleted?

@mprahl it could... It would potentially lose the "createdByPolicy" information (which I think we already document as unreliable), and my instinct is that the process would be unreliable.

@mprahl
Copy link
Member

mprahl commented Apr 24, 2024

@JustinKuli a bit of an odd thought, but in a template error situation where the current configuration policy has pruneObjectBehavior, could the template-sync disable pruneObjectBehavior and then delete it so that a template error situation always results in the policy template being deleted?

@mprahl it could... It would potentially lose the "createdByPolicy" information (which I think we already document as unreliable), and my instinct is that the process would be unreliable.

What would be unreliable? I think an update and then delete would be reliable since the config-policy-controller won't notice the deletion until it has refreshed its cache and thus seen the update.

I'm thinking that it's probably the more expected behavior to remove the previous template rather than having an in between state where old clusters have the old version and new clusters don't have a version of the policy template.

The plan is to add template support to more policy types. Currently it
is restricted to ConfigurationPolicies. This removes that restriction
and reports hub template errors that would be added by the propagator,
so that each policy controller does not need to duplicate the logic to
emit those events themselves.

Refs:
 - https://issues.redhat.com/browse/ACM-10858

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
The compliance events emitted by the template-sync controller should
have priority over any created by the policy controllers. For example,
in the past we had issues where a "late" compliance event from a
controller would override the Pending status of a policy, and there are
other template-error statuses that are emitted by the template-sync
which could have the same situation. To ensure that a template-sync event
is always emitted after a different compliance event, this controller
should reconcile in more situations.

Note that it might seem better if the template-sync controller removed
the templates that have had an error, but in the case of
ConfigurationPolicy, removing it can cause undesirable side-effects due
to PruneObjectBehavior.

Refs:
 - https://issues.redhat.com/browse/ACM-10858

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
When adding a template-error to the status for a template that can't be
decoded, the status-sync was incorrectly appending to the details array
twice. So for something that would be called "template-0", the 0th item
in the details array would be almost empty, and the 1st item would have
all of the status events. This was causing some events to be emitted
repeatedly, since the template-sync logic to prevent this was looking at
the 0th item, which would never get the event.

This commit also cleans up some awkward pointer gymnastics.

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
This should help reduce some duplication of status events, but it might
not fully prevent it.

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
With the template-sync now reconciling more often (there are no longer
any cases where the reconcile can be skipped), some event duplication
might be unavoidable.

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
@JustinKuli
Copy link
Member Author

@mprahl I would prefer to more thoroughly discuss separately whether (and how) to remove templates during an error case, rather than hold this PR for the decision. I think the behavior of this PR is a step forward.

Copy link

openshift-ci bot commented Apr 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JustinKuli, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants