-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managed Infrastructure Maintenance Operator - Milestone 1 #3571
base: master
Are you sure you want to change the base?
Conversation
Please rebase pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started a review, and reached my ingestion limit. I'll keep reviewing later.
@@ -228,6 +231,7 @@ require ( | |||
github.com/robfig/cron v1.2.0 // indirect | |||
github.com/rogpeppe/go-internal v1.12.0 // indirect | |||
github.com/russross/blackfriday v1.6.0 // indirect | |||
github.com/russross/blackfriday/v2 v2.1.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need a markdown library in our go.mod?
// Licensed under the Apache License 2.0. | ||
|
||
type MaintenanceManifestDocuments struct { | ||
Count int `json:"_count,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a Count
? Is that not the same as len(MaintenanceManifestDocuments)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's returned by the cosmosdb API
) | ||
|
||
const ( | ||
MaintenanceManifestQueryForCluster = `SELECT * FROM MaintenanceManifests doc WHERE doc.maintenanceManifest.state IN ("Pending") AND doc.clusterID = @clusterID` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is select *
any less performant than select <fields>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's json, probably not
|
||
type MaintenanceManifests interface { | ||
Create(context.Context, *api.MaintenanceManifestDocument) (*api.MaintenanceManifestDocument, error) | ||
GetByClusterID(context.Context, string, string) (cosmosdb.MaintenanceManifestDocumentIterator, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the fact that the manifests also include a ResourceId
I think we should also have a get based on that.
pkg/database/mimo.go
Outdated
PatchWithLease(context.Context, string, string, MaintenanceManifestDocumentMutator) (*api.MaintenanceManifestDocument, error) | ||
Lease(context.Context, string, string) (*api.MaintenanceManifestDocument, error) | ||
EndLease(context.Context, string, string, api.MaintenanceManifestState, *string) (*api.MaintenanceManifestDocument, error) | ||
Dequeue(ctx context.Context, clusterID string, id string) (*api.MaintenanceManifestDocument, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the equivalent of cancelling a MaintenanceSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dequeue is part of the logic of claiming it as being worked on -- see the openshiftcluster one for where this is from
if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict { | ||
err.StatusCode = http.StatusPreconditionFailed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we overwriting the http status condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.
return c.c.Get(ctx, clusterID, id, nil) | ||
} | ||
|
||
// QueueLength returns maintenanceManifests un-queued document count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is a little confusing. I think we need to work a little more on definitional language here.
e.g.:
- Scheduled
- Queued
- Pending
- etc
And what these statuses might all mean in practice.
It seems to me like QueueLength is returning the list of MaintenanceSets that are pending delivery to the actuator?
pkg/database/mimo.go
Outdated
func (c *maintenanceManifests) Patch(ctx context.Context, clusterID string, id string, f MaintenanceManifestDocumentMutator) (*api.MaintenanceManifestDocument, error) { | ||
var doc *api.MaintenanceManifestDocument | ||
|
||
err := cosmosdb.RetryOnPreconditionFailed(func() (err error) { | ||
doc, err = c.Get(ctx, clusterID, id) | ||
if err != nil { | ||
return | ||
} | ||
|
||
err = f(doc) | ||
if err != nil { | ||
return | ||
} | ||
|
||
doc, err = c.c.Replace(ctx, doc.ClusterID, doc, nil) | ||
return | ||
}) | ||
|
||
return doc, err | ||
} | ||
|
||
func (c *maintenanceManifests) patch(ctx context.Context, clusterID string, id string, f MaintenanceManifestDocumentMutator, options *cosmosdb.Options) (*api.MaintenanceManifestDocument, error) { | ||
var doc *api.MaintenanceManifestDocument | ||
|
||
err := cosmosdb.RetryOnPreconditionFailed(func() (err error) { | ||
doc, err = c.Get(ctx, clusterID, id) | ||
if err != nil { | ||
return | ||
} | ||
|
||
err = f(doc) | ||
if err != nil { | ||
return | ||
} | ||
|
||
doc, err = c.update(ctx, doc, options) | ||
return | ||
}) | ||
|
||
return doc, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally these two functions are almost identical. What's the purpose of c.update
vs c.c.Replace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like c.update
calls c.c.Replace
appears that this is pretty much duplicate functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so only difference is the ability to pass-in cosmosdb functions. I think these two should be mostly combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've managed to deduplicate this now
return c.c.ChangeFeed(nil) | ||
} | ||
|
||
func (c *maintenanceManifests) GetByClusterID(ctx context.Context, clusterID string, continuation string) (cosmosdb.MaintenanceManifestDocumentIterator, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the type of continuation
be defined as not a string
but some other type for the purposes of understanding what values are acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an opaque string, since it's part of the cosmosdb API
pkg/database/mimo.go
Outdated
PatchWithLease(context.Context, string, string, MaintenanceManifestDocumentMutator) (*api.MaintenanceManifestDocument, error) | ||
Lease(context.Context, string, string) (*api.MaintenanceManifestDocument, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear from the implementation what the purpose of Lease and PatchWithLease are supposed to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned these up a bit and added comments
/azp run ci, e2e |
Azure Pipelines successfully started running 2 pipeline(s). |
Please rebase pull request. |
Which issue this PR addresses:
Part of https://issues.redhat.com/browse/ARO-4895.
What this PR does / why we need it:
This PR is the initial feature branch for the MIMO M1 milestone.
Q&A
What should I use this PR for?
This PR is for people to understand the general state of MIMO, how it works, and how they can implement tasks in it. It is not meant to merge as-is.
Where should I start looking?
pkg/mimo/tasks/example/task.go
if you want to see how the developer experience is.pkg/mimo/actuator/manager.go
for the service logic.Looks great, when can I have it?
See https://issues.redhat.com/browse/ARO-4895 for all the things we need before we can roll this out. Namely, we need to do some porting of the infrastructure away from being solely on the VMSSes before we can roll this out.
What is final, and what is subject to change?
The runtime logic in
service.go
is very subject to change. Right now, it doesn't use changefeeds for a complicated reason (because I want to see if we could get away with using the official CosmosDB SDK as a test...) and I tried to make it so that it would handle a lot of clusters if it had to, but not get ahead of myself with the scaling of the actuator. (There's several approaches we could take -- this is more like the Monitor, but we could use the backend of OpenShiftClusters style + dequeueing. However, since we don't want two tasks for the same cluster worked on at the same time, the same dequeue style doesn't seem very efficient. In the interim, I'm approaching it from the angle of the monitor where there's the 255 or so buckets, but foregoing the auto bucket balancing and instead just statically distributing them based on something like a StatefulSet instance ID, relying on the buckets to be relatively statistically spread out. I'm not sure if that's even a good idea, but I can worry about it after vacation, and people can chime in with ideas.)The CLI uses urfave/cli/v2 instead of Cobra because of the better support for environment vars as selectively-required arguments than cobra+viper. This might be worth standardising on, or moving to cobra and dealing with a bit more manual boilerplate.
The idea of the TaskFunc and TaskContext is relatively final, although the arguments/params are not. The idea of the separation is seen in
pkg/mimo/tasks/example/task_test.go
where one can ignore the rest of MIMO and just focus on the task implementation and test environment. This, hopefully, makes the business logic fairly portable in case of some other better solution in the future.This PR is very big!
Yes, since it consists of not only the database, CLI, task, and other machinery for MIMO, but likely other improvements/code cleanups/etc. Anything that is not part of the new functionality (e.g. refactors) are planned to be split out as and where they make sense to do so.
Wow your commit history sucks
git rebase
will end up putting those skeletons in the closet, don't worry :)Test plan for issue:
Some unit tests are included. Proper QA will need to occur.
Is there any documentation that needs to be updated for this PR?
Yes, see https://issues.redhat.com/browse/ARO-4895 .
How do you know this will function as expected in production?
Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.