Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator should check all instances versions before upgrading itself #485

Open
yaronp68 opened this issue Nov 23, 2020 · 10 comments
Open

Operator should check all instances versions before upgrading itself #485

yaronp68 opened this issue Nov 23, 2020 · 10 comments
Labels
design-doc-needed Design doc required before implementation enhancement New feature or request never-stale Issue or PR marked to never go stale

Comments

@yaronp68
Copy link

As a user I want to make sure I will be able to upgrade all my clusters through the operator. To prevent such upgrade issues, the operator should not upgrade itself unless all instances are in the latest minor or major the operator supports before upgrade. In case one or more instances are not upgraded the operator should not upgrade and issue a warning. The user can override this with some flag in the yaml to ignore upgrade path problems or similar

@ansd
Copy link
Member

ansd commented Nov 23, 2020

... such that we comply with

However, note that only upgrading from one minor to the next minor or major is supported. To upgrade from e.g. 3.6.16 to 3.8.7, it is necessary to upgrade to 3.7.28 first.

https://www.rabbitmq.com/feature-flags.html#version-compatibility

@MirahImage
Copy link
Member

A key point with this is that the validation needs to happen before the operator is upgraded, warning when attempting to apply YAML to upgrade the operator version. This feels like a good case for a validating webhook for the operator itself, if that's possible?

@mkuratczyk
Copy link
Collaborator

I wonder under what circumstances this would be actually used. The quote from RabbitMQ docs is not necessarily an argument for a check like this - we might as well allow the operator to upgrade but then prevent a (hypothetical) 3.8->3.10 cluster upgrade. For me that's more flexible and works better when the person responsible for the Operator's deployment and the owner of the RabbitMQ cluster are different folks on different teams. Now, how would the Operator know that 3.8->3.10 is attempted is less clear - since images can be relocated and renamed, relying on image tags can be tricky (although we could make that an explicit requirement for relocated images).

Another option is something we discussed a few months back, which is to add metadata to the OCI images. When the image property changes in the RabbitmqCluster definition, Operator could query the metadata of the new image from the OCI Registry to find version information (version is already available in that metadata; you can use docker inspect rabbitmq to see what's there). A nice bonus here would be that the image could contain information about the minimum operator version required so that if there is a breaking change in the image/RabbitMQ, we could make sure that even an old Operator would behave correctly (it wouldn't know how to handle the breaking change itself but it could refuse to upgrade to that version). Hopefully we would never need that feature if we just prevented skipping minors but it did happen in the past (there was a 3.6.x patch release that had a breaking change).

@MirahImage MirahImage added the enhancement New feature or request label Dec 1, 2020
@github-actions
Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Jan 31, 2021
@MirahImage MirahImage added never-stale Issue or PR marked to never go stale and removed stale Issue or PR with long period of inactivity labels Feb 2, 2021
@mkuratczyk mkuratczyk added the design-doc-needed Design doc required before implementation label Oct 1, 2021
@kaushiksrinivas
Copy link

@MirahImage @yaronp68 @mkuratczyk
Is there any update on this discussion ?. Code browsing reconcilers in the operator, we do not see version compatibility of rabbitmq being considered and statefulset is just updated with the latest update.

Documentation page of operator highlights this, Scaling up and automated rolling upgrades of RabbitMQ clusters.
Would it be better to consider version compatibility checks during rolling upgrade attempts ?

Also, we do not see features like maintenance mode, rebalancing queues during scaling cluster etc. being used. Are these not in the scope of the operator yet ?

@mkuratczyk
Copy link
Collaborator

Hi.

Rebalancing and maintenance mode are used:
https://github.com/rabbitmq/cluster-operator/blob/main/api/v1beta1/rabbitmqcluster_types.go#L76-L80
https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset.go#L620

As for version version checks, we keep coming back to this discussion but it's not implemented yet. Have you run into a case where an upgrade failed because of no such checks?

The problem is ultimately with the flexibility of containers. Since we allow the image value to be set arbitrarily, we can't really tell what version will be deployed after the image field is changed (it is currently possible to set the image to something like rmq:prod with no version info at all). As in my previous comment - we could make a specific tag structure a requirement or query the registry first to check image metadata (again, we would need to require relocated images to still contain specific keys/values in the metadata) but for the time being we don't do that.

@kaushiksrinivas
Copy link

@mkuratczyk
No, we have not run into the issue so far. But we were just examining the upgrade procedures considering a lot of documented incompatible versions in rabbitmq. Is there any mitigation plans in place suppose such an incompatible version upgrade is done and quorum is disrupted ?

@mkuratczyk
Copy link
Collaborator

Can you give some examples of documented incompatible versions in rabbitmq?
This issues is here because we realize the day will likely come when we need to implement that. However, the Operator was developed with RabbitMQ 3.8 in mind and since then 3.9 was released. 3.8->3.9 upgrade should be pretty straightforward - there was not need to implement this check.

@discostur
Copy link

I think this should be handled different - a common way would be to set a "cluster version" in your rabbitmqcluster.rabbitmq.com CRD. The user should not be able to set a custom docker image url / tag. The operator should then watch the CRD and if you increase your cluster version from 3.8 -> 3.9 is should trigger a rolling upgrade of the existing containers (one by one).

@mkuratczyk
Copy link
Collaborator

Thanks for your input. This is one of the options we are considering. However, it's not without drawbacks. Many large users relocate images to their internal registries. As I mentioned in the comments above, this leads to potentially unpredictable images names. At the very list, the Operator would need it's own configuration to define the registry. Likely, it would also need some configuration options for tags or a requirement not to change the tags when relocating the image. We are deferring work on this issue, hoping that a pattern will emerge in the Kubernetes community so that we can follow it. For the time being, we offer maximum flexibility to enable different use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc-needed Design doc required before implementation enhancement New feature or request never-stale Issue or PR marked to never go stale
Projects
None yet
Development

No branches or pull requests

6 participants