Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ADR for kURL Add-on operator #4488

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rrpolanco
Copy link
Contributor

What this PR does / why we need it:

A proposal for managing kURL add-ons via a Kubernetes operaor.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Steps to reproduce

Does this PR introduce a user-facing change?


Does this PR require documentation?

@rrpolanco rrpolanco added type::feature An enhancement to an existing add on or feature type::docs Improvements or additions to documentation labels May 12, 2023
- It will distribution agnostic. I.e. it will not be tightly coupled to the kURL Kubernetes distribution.
- It will manage application add-ons _only_ and _exclude_ managing CRI and CNI resources.
- It will install and confiure the KOTS admin console.
- It can be installed on an existing [kURL](https://kurl.sh/) cluster with only Kubernetes, containerd and flannel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this aspect is still uncertain at this point. If we aim to ensure backwards compatibility with the existing infrastructure, could potentially add more complexities. This approach would require testing a combination of the old and new components to ensure compatibility and identify any potential challenges or conflicts that may arise.

IMO to reduce the complexities I think we could be targeting that this new solution would be tested and be used to target a fresh instance of vanilla Kubernetes, such as k0s, kind, or minikube and ensure that it can install/manage a small set of the AddOns only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the ADR mentions backward compatibility. The phrasing seems correct: as long as the kURL installation has enabled only Kubernetes, Containerd and Flannel it should work. On that regard a kURL with Kubernetes, Containerd and Flannel is not different from k0s, kind or minikube.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this objective from @emosbaugh add-on operator proposal doc. I think the point here is that it should work on a kURL Kubernetes minimal distro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricardomaraschini and @rrpolanco,

If we intend to use kURL as a foundation, it is crucial that we clearly define our objectives and the limitations we are willing to accept. We should strive for a clear and concise definition to avoid complications associated with ensuring backward compatibility.

Otherwise, we will expend the scope and be comitted with a lot of complexities.


### The Controllers

The operator will have at minimum one controller which ensures the `Installer` CR is reconciled with the cluster periodically. The `Installer` controller (main controller) will delegate responsiblity for managing add-ons to other controllers. The reconcilation for the main controller will consist of determining the add-ons to be managed from the `Installer` CR and creating a controller for each add-on (addon-on controller).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot make a controller create a another controller. I think what you want to say would be like

Suggested change
The operator will have at minimum one controller which ensures the `Installer` CR is reconciled with the cluster periodically. The `Installer` controller (main controller) will delegate responsiblity for managing add-ons to other controllers. The reconcilation for the main controller will consist of determining the add-ons to be managed from the `Installer` CR and creating a controller for each add-on (addon-on controller).
The solution involves implementing a set of controllers, the Installer controller, to periodically reconcile the Installer custom resource (CR) with the cluster. The Installer controller will not only handle the reconciliation process but also create custom resources (CRs) to represent the add-ons and processes, such as Storage Migrations, within the cluster.
During reconciliation, the Installer controller will determine the add-ons and processes specified in the Installer CR. It will then create separate CRs to represent each add-on or process. For example, if there is a need to perform a Storage Migration, the Installer controller will create a specific CR to represent that process.
In addition to creating the CRs, the Installer controller will delegate the management of each CR to dedicated controllers. These controllers will be responsible for monitoring and maintaining the desired state of their associated CRs, following the concept of single responsibility.
By adopting this approach, the operator can ensure that the Installer controller creates and manages the necessary CRs, representing add-ons and processes like Storage Migrations, while allowing separate controllers to handle the management and enforcement of the desired state for each specific component.


- How to allow full configuration of add-ons for advanced vendors/users?
- Do we use an existing feature rich Helm controller like the one provided by [Flux](https://fluxcd.io/flux/components/helm/) or build our own?
- How to structure the add-on to adhere to SOLID principles?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind:

Our proposed solution seems to be aligned with the SOLID principles by adopting a one custom resource (CR) and one controller per component approach. This design ensures that each component has clear separation and single responsibility, promoting modularity and maintainability throughout the codebase.

However, it's either important to be aware that our solution can still be seen as an "umbrella operator" since it encompasses multiple solutions within a single project. As we introduce more AddOns or components, it's crucial to be aware of the potential complexities that may arise, including performance issues and others bad side effects.

To address these concerns, in my POV we need keep in mind the requirement to properly encapsulate the logic of each component. By encapsulating the functionality effectively, we create the potential for future decoupling of AddOns and/or Processes into separate sub-projects (other Operators, for example) if necessary. This modular design approach provides the flexibility to manage each component independently, enabling scalability and adaptability as our solution evolves.

What do you think about it? Do you think that address your concern over this item?
If not, could you share your concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrpolanco I want to emphasize my opinion that adopting the Flux approach to address our requirements, such as smooth upgrades, migrations, and specific automations, would lead to a situation where we would need to manipulate the responsibilities of the Flux Operator. This would result in having two operators overlapping their responsibilities, which goes against the SOLID principles.

I have made efforts to highlight this point in various places throughout this ADR, as well as in our communication channels and other documentation.

c/c @ricardomaraschini

## Open Questions

- How to allow full configuration of add-ons for advanced vendors/users?
- Do we use an existing feature rich Helm controller like the one provided by [Flux](https://fluxcd.io/flux/components/helm/) or build our own?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Do we use an existing feature rich Helm controller like the one provided by [Flux](https://fluxcd.io/flux/components/helm/) or build our own?
- Do we use an existing feature rich Helm controller like the one provided by [Flux](https://fluxcd.io/flux/components/helm/) or should we use pure Helm and do not add extra generic third-party solutions as dependency to manage the charts?

I think we never will build Flux because the only functionality that we need is manage Helm Charts itself right.

- Chart values override\
- Pull charts from Helm repository directly

_Develop our own Homebrew Helm Controller_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Develop our own Homebrew Helm Controller_
_Using pure Helm without add any third-party as dependency_

I can not agree with (Develop our own Homebrew Helm Controller). Even though the Helm functionality we require is limited to managing the Helm charts themselves (installing, uninstalling, and upgrading), which currently consists of just one file in the prototype, all the other requirements still remain in place if we choose to use Flux. If we need to pull charts from helm repo would not it either provided by helm lib as dependency as a feature?

Indeed I am not convinced yet if we adopt Flux it will remove 100% the need of we interact directly within Helm and that brings concerns in POV. Because we will end up in scenarios where we need to interact with Helm directly to for example automate bugs, error, and fix that was not sorted out by the generic flux logic which would bring a lot of complexities regards race conditions for example.

Therefore, it seems for me that the proposals solutions here are: Use Flux to manage HelmCharts (install/uninstall/upgrade) vs using pure Helm as dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like helm does not update crds

https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations

and thus flux has added logic and policies around crds

https://github.com/fluxcd/helm-controller/blob/main/internal/runner/runner.go#L229C31-L230

I cant say what other complexities we will find when trying to implement this ourselves and thus it may be best to offload this to flux

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally flux has built in test support for validating releases. This could be beneficial

https://github.com/fluxcd/helm-controller/blob/main/internal/controllers/helmrelease_controller.go#L436

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emosbaugh,

Flux is who manages and that is the whole issue.
It means we are NOT allow to prevent the solution be uninstalled since we are NOT responsible for manage it.
Therefore, see that HelmController is generic and flux never will allow us to do our specific checks prior allow or not solution get uninstalled. Example: is all PVC migrates? Have you other object store installed and etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these comments I am only listing the benefits of using flux. I understand you point about uninstalling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, see that HelmController is generic and flux never will allow us to do our specific checks prior allow or not solution get uninstalled. Example: is all PVC migrates? Have you other object store installed and etc.

I don't understand why we can't perform checks like PVC migrations and object store migrations before uninstalling via Flux. Flux won't remove the object store until we tell it too, why can't we do our migrations, verifications, and then tell flux to remove it when we're ready?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-sanders @emosbaugh @ricardomaraschini @rrpolanco @laverya:

The primary issue with this approach, as per SOLID principles, particularly the Single Responsibility Principle (SRP) and the Liskov Substitution Principle (LSP), is that each class (or in this case, operator) should have one, and only one, reason to change, and that subclasses (or dependent services) must be substitutable for their base classes without altering the correctness of the program.

If we were to start manipulating Persistent Volume Claims (PVCs), or any resources managed by Flux as part of its reconciliation loop, it essentially means that our custom operator would be taking on more than one responsibility. This would violate the SRP, as our operator should only be concerned with its unique tasks, while Flux is tasked with managing the Helm charts.

In addition, by performing operations on resources managed by Flux before Flux has completed its own tasks, we're violating the LSP. Flux and our operator can be seen as base and derived classes, respectively. By interfering with Flux's responsibilities, we risk introducing errors into the system, because our operator would no longer be a drop-in replacement for Flux.

In a real-world scenario, imagine a race where both operators want to update or modify the PVCs. If we migrate and validate the PVCs independently and then allow Flux to remove them, there's a chance Flux could overwrite our changes in its reconciliation loop. This can lead to inconsistencies in the state of resources, potentially causing conflicts, data loss, or system instability. In essence, any changes we make could be nullified by Flux, wasting resources and leading to an endless cycle of modifications. Therefore, it is critical to respect each operator's boundaries and allow them to manage their designated resources independently.


Proposed

## Open Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following would be the questions regarding Flux that I have in mind and would be required to answer before we move forward in this direction.

Automation Limitations
Despite its automation capabilities, are there any aspects of the deployment process that Flux may not be able to automate effectively, requiring manual intervention or additional tooling.

  • Therefore, will adopting Flux truly eliminate the need to use Helm as a dependency for functions such as installation, upgrade, uninstallation, or checking the status of a release?
  • Or will we end up with both implementations, adding unnecessary complexities?
  • If we need to automate those manual interventions, what complexities related to race conditions and concurrency issues could we potentially face?

(limitation) Unable to properly automate the operations for each specific AddOn solution. The controller is generic and may not work effectively with all HelmCharts. By implementing our own solution per Addon, we can automate solutions based on the specific requirements of each Addon. This allows us to address specific needs and requirements more accurately, as a generic implementation will not perform in the same way. It's important to note that a generic automation for a HelmChart will never be able to handle specific issues, workarounds, or tailor solutions that are commonly done in KURL for what we ship.
(high risk of failure) It appears that Flux will NOT eliminate the requirement for direct interaction with Helm. Consequently, this will necessitate handling complexities, such as race conditions, that are computationally intensive and challenging to manage.

Flux as a Dependency:

  • Can Flux be effectively used as a dependency within the Operator? Is it possible to create a Flux CR from the Operator and interact with Flux's internal functionality? Or are not they allowing importing their APIs? Is their code implementation exported for usage or not?

NOTE: We would need to import: https://github.com/fluxcd/helm-controller/tree/main/api/v2beta1

What happens when we need bug fixes and RFEs if we adopt flux?

  • If we face issues within the project, will Flux provide support for such cases, and if so, what areas do they support?
  • In the event that we need a feature enhancement or bug fix in Flux to address our specific needs, how will that process work?

NOTE: It has a lot of Open issues only related to the HelmRelease which will need to use if we move in this direction. This implies that we would need to find workarounds or take responsibility for maintaining the Flux project ourselves. This requires time and effort, and there is a risk that our contributions may not be accepted, preventing us from releasing solutions to address our needs. As a result, we would find ourselves in a difficult position, having to implement the same functionalities as pure Helm to work around issues and deal with a lot of additional complexities hard do deal with, all only because we adopted Flux.

Flux Features

  • Which specific features provided by Flux would be advantageous to adopt in our project?

IMO: None so far was found.

  • What are our requirements that would be fulfilled by adopting Flux? In other words, what functionalities would we not need to implement separately due to Flux integration?

IMO: None so far was found.

  • Are there any scenarios where using Flux could create limitations or challenges in managing HelmCharts or maintaining control over them?

Yes. Following a few of them:

  • Limitations to properly use finalizers and prevent the Helm solution be uninstalled without the required operations that would need to get done occurs prior that More info
  • There are limitations in controlling Helm to effectively address and automate various requirements. In numerous scenarios, direct usage of Helm becomes necessary, such as when working around an issue or automating a specific need. However, this can lead to challenging situations, as we may encounter concurrency and race condition issues that are difficult to handle

Complexity:

  • While FluxCD provides a powerful feature set, configuring and using it can be complex. How much effort will be required to integrate and become proficient in Helm + Flux?
  • Considering the extensive features offered by Flux, do we truly need all of them, or would only a subset be sufficient? Will the integration justify the learning curve and potential limitations?
  • How complex will it be to test the solution with this dependency in local development and CI environments?

Compatibility:

  • Which Kubernetes versions are supported by Flux? Are there any limitations?
  • What are the supported Helm versions and the extent of Helm features that Flux can handle?
  • Can Flux handle Helm hooks effectively?
  • How does Flux manage its release lifecycle, and what are the support policies?
  • Have Flux any limitation regards what is provided by Helm?


_Develop our own Homebrew Helm Controller_

This approach affords us with a controller that meets our specific needs and avoids any third party software dependency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the advantages here are:

Suggested change
This approach affords us with a controller that meets our specific needs and avoids any third party software dependency.
- **Simplified Architecture**: By not incorporating Flux, we can simplify the architecture of our Operator. It reduces the number of components and dependencies involved, streamlining the overall system design.
> By adopting Flux, there is uncertainty about whether it completely eliminates the need to use Helm as a dependency. This uncertainty arises from the possibility of ending up with a design solution that requires both Flux and Helm to be used together. In such a scenario, the advantages of using Flux as a standalone solution may be limited if Helm is still required for certain functionalities. Indeed, if the design solution requires both Flux and Helm as dependencies, additional complexities, including race conditions and concurrency issues, may arise. Managing the interactions and synchronization between Flux and Helm components can introduce challenges related to concurrent modifications, conflicting actions, and maintaining the desired state of the deployment.
- **Reduced Learning Curve: Utilizing Helm directly as a dependency means we can leverage existing knowledge and experience with Helm and we do not need to either Flux experts. This reduces the learning curve of our team, as they can continue using familiar Helm concepts and workflows.
- **Fine-Grained Control:** Managing Helm directly allows we to have fine-grained control over the deployment process. We can customize and adjust Helm commands, flags, and options according to our specific requirements without the abstraction layer introduced by Flux.
- **Flexibility in Tooling:** Without Flux, we have the flexibility to choose and integrate other tools and solutions into our deployment pipeline as needed. We are not limited to the capabilities and features provided solely by Flux.
- **Direct Helm Features:** Helm offers a rich set of features, such as release management, versioning, and dependency resolution. Utilizing Helm directly allows us to leverage these features to their fullest extent, tailoring them to our specific use cases.
- **Compatibility:** Helm provides support to be consumed as lib and has a large user community that does. Using Helm directly ensures compatibility with the broader Helm ecosystem and the ability to easily integrate HelmCharts from various sources.
- **Community Support:** Helm has an active and supportive community, providing resources, documentation, and community-driven enhancements. We can rely on the Helm community for assistance, knowledge-sharing, and issue resolution.
- **Simplified Development Environment**: Using only Helm as dependecy Developers can deploy the operator without any extra configuration and pre-setup. The only requirement would be to have a kubernetes cluster and Golang reducing the learning curve and enabling faster development iterations. Same aspect seems to be applied to the CI/CD and to implement tests for the solution.

- Stable [HelmRelease API](https://fluxcd.io/flux/components/helm/api/)
- Helm release [dependencies](https://fluxcd.io/flux/components/helm/helmreleases/#helmrelease-dependencies)
- Support for Helm install, upgrade, test, rollback and uninstall
- Performs Helm uninstall on `HelmRelease` CR removal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, the fact that Flux uninstalls the HelmChart when a flux HelmRelease is deleted is expected but not an advantage in our case and I think it should be considered a drawback. This scenario exemplifies the limitations and complexities that may arise when adopting Flux, potentially restricting our control over our solutions according to our specific needs. It restricts our ability to control the deletion process and ensure that essential operations are completed before uninstalling the solution.

TL'DR': WHY?
The fundamental purpose of our Operator is to establish a control-loop that triggers whenever a change occurs. Its primary objective is to ensure, through a consistent and idempotent set of operations, that the desired state matches the current state. This control-loop serves as a mechanism for continuously monitoring and maintaining the system in line with our defined configurations and expectations.

For instance, if a user mistakenly attempts to delete an AddOn CR, we invoke a finalizer function that ensures the AddOn CR is only deleted once all necessary operations, such as uninstalling the HelmChart, have been successfully completed. You can refer to this example in our prototype here.

However, when using Flux, if a user accidentally deletes the HelmRelease CR that Flux uses for managing Helm releases, it will uninstall the HelmChart solution through the FluxHelmController. In such a scenario, our control-loop can only recreate the Flux HelmRelease but cannot prevent this situation or guarantee that required backups or migrations, for example, have occurred, as we could if we relied solely on Helm. Unless would we be able to say to Flux that it can only uninstall the HelmRelease when a flag that it can be done by adding some info into its HelmRelease CR to prevent this scenario. Is that possible with Flux?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example when it's beneficial to not delete the CR? I can't think of a situation and it would be helpful to understand why you see it beneficial to write custom deletion code ourself.

I'd like us to focus on specific issues and benefits. If this is all based on potential limitations that's not sufficient to make the decision. Do we see limitations or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-sanders

The question isn't about choosing not to delete the Custom Resource (CR), but about ensuring that all necessary precautions for a safe uninstallation are undertaken before deleting a CR, which represents an Add-On and its corresponding solution in the cluster.

For example, consider a scenario where the OpenEBS Kurl/Installer CR is deleted manually or via any other solution, whether by mistake or intentionally. Before proceeding with the Helm uninstallation of OpenEBS, we could guarantee that an alternative storage solution is already in place and that the volume data has been successfully migrated to this new solution. Otherwise, the OpenEBS CR will not get deleted unless you explicitly force it manually. Furthermore, we could for example ensure that backup solution should be employed to store a copy of the data. This will allow data recovery via support, if necessary.

In essence, the focus is not on whether or not to delete the CR but rather on ensuring that all the required safety checks and procedures are completed prior to its deletion.

Same is valid if someone try to delete the namespace where the AddON CR leaves. By using finalizers the namespace cannot get deleted prior all resources on it be first get deleted. To better understand you can check the how finalizers work in the k8s docs: https://kubernetes.io/docs/concepts/overview/working-with-objects/finalizers/ and see how the k8s api and us building an Operator would leverage in the cascading delation: https://kubernetes.io/docs/tasks/administer-cluster/use-cascading-deletion/

_Use the Helm Controller provided by Flux_

The [Flux Helm Controller](https://fluxcd.io/flux/components/helm/helmreleases/) affords us the following benefits:
- Stable [HelmRelease API](https://fluxcd.io/flux/components/helm/api/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by a "stable API"? Since it is labeled as v2beta1, it doesn't appear to be stable. Being in the beta phase implies that there could potentially be breaking changes. Therefore, it is possible for breaking changes to occur.

That is a great point for we check, how often does Flux introduce breaking changes in their releases and if that should or not be an concern for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this. I got it mixed up with the GitRepository API. But the point about relying on a beta API still stands.


The [Flux Helm Controller](https://fluxcd.io/flux/components/helm/helmreleases/) affords us the following benefits:
- Stable [HelmRelease API](https://fluxcd.io/flux/components/helm/api/)
- Helm release [dependencies](https://fluxcd.io/flux/components/helm/helmreleases/#helmrelease-dependencies)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a cool feature, but it appears for me to be more suitable for working within tooling environments rather than building an Operator and our use case. I would think that would be very nice if we have only a tooling.

Consider the trade-off between to create a HelmRelease CR with dependencies versus directly invoking Helm to install the dependencies before deploying the HelmChart.

By using pure Helm we have greater control over the order and state of the dependencies before adding our HelmChart. However, with a HelmController managing the dependencies, we may face uncertainties about its reliability. In the event of a problem with a dependency, it raises concerns about how we can work around such issues. If we attempt to interact with Helm directly to address potential pitfalls, it can introduce complexities such as race conditions that are challenging to handle effectively in any project.

IHMO: In summary, while the HelmController feature offers convenience within tooling environments, it may not provide the same level of reliability and control as using Helm directly when we are building an Operator. We need to carefully consider the implications, potential workarounds, and complexities that may arise when managing dependencies through a HelmController in our specific project context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, using another tool does mean loss of some control. It also means less effort to implement and maintain going forward. Our decision needs to be based on both and just stating that it's more precise without addressing the implementation and maintenance costs is not addressing the full story. Our dev time is limited, loosing precise control in favor of not having to implement and maintain is one of our goals moving to a new implementation.

My concern here is, why would we want to manage dependencies explicitly that doesn't sound like an advantage to me unless you have examples of bad decisions flux has made doing this that we need to implement in another way.

The [Flux Helm Controller](https://fluxcd.io/flux/components/helm/helmreleases/) affords us the following benefits:
- Stable [HelmRelease API](https://fluxcd.io/flux/components/helm/api/)
- Helm release [dependencies](https://fluxcd.io/flux/components/helm/helmreleases/#helmrelease-dependencies)
- Support for Helm install, upgrade, test, rollback and uninstall
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our prototype implementation here, we have demonstrated the ability to interact with Helm for installing, uninstalling, upgrading, retrieving release information, and converting the release status for a HelmChart. This functionality is achieved with pure Helm, utilizing a single file with just a few lines of code. Since the logic is still handled by the Helm dependency, the implementation remains concise.

For example, with pure Helm we would not only call Helm dep as the following code to do a rollback:

func helmRollback(releaseName string, version string) error {
	// Set up the Helm action configuration
	settings := cli.New()
	actionConfig := new(action.Configuration)
	err := actionConfig.Init(settings.RESTClientGetter(), settings.Namespace(), os.Getenv("HELM_DRIVER"), debug)

	if err != nil {
		return fmt.Errorf("failed to initialize Helm action configuration: %w", err)
	}

	// Create the Helm rollback action
	rollbackAction := action.NewRollback(actionConfig)
	rollbackAction.Version = version

	// Perform the rollback operation
	_, err = rollbackAction.Run(releaseName)

	if err != nil {
		return fmt.Errorf("failed to perform rollback operation: %w", err)
	}

	return nil
}

So, do you see anything that would be difficult for us to achieve by calling Helm directly, and could Flux provide it for us? Am I overlooking any complexities that Flux handles in these scenarios that I haven't considered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything valuable we find in the helmrelease_controller.go i assume we would have to build ourselves

https://github.com/fluxcd/helm-controller/blob/main/internal/controllers/helmrelease_controller.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emosbaugh We'll need a controller in any scenario to manage statuses, finalizers, and so on. Flux, while a robust solution, is generic and therefore doesn't address our specific needs and requirements.

Our objective is highly targeted, and it's important to remember that we can't attempt to update or delete any resource created and managed by Flux. Doing so would trigger the control-loop which could lead to unintended consequences. In many scenarios, this could result in Flux re-executing all operations to ensure its processes are maintained, as that's what a control-loop is designed to do. Consequently, the issue we tried to resolve might resurface. Alternatively, we could encounter race or concurrency conditions, as both we and Flux would be trying to update/change the same resources simultaneously.

Flux isn't tailored to cater to specific needs, like what we're seeking here, nor is it designed for the kind of integration and usage we're proposing. Its main function is to provide a tool capable of automating the process of installing, updating, or uninstalling a HelmChart, without considering any additional requirements. For example, it wouldn't accommodate a scenario where Kots is configured with N, implying that X must be present. In situations where you try to uninstall X without Y, it won't work. Flux doesn't account for these kinds of complexities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use flux, I suggest we never modify the Kubernetes resources themselves and instead modify values through the HelmChart CR. If we ever had to we can use the suspend property of the CRD.

Flux is a reconciliation loop. If one chart has a precondition it will fail until that precondition is met and then succeed.

I feel the beauty in the flux helm controller is it is built to do one thing only and that is manage the lifecycle of helm charts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything valuable we find in the helmrelease_controller.go i assume we would have to build ourselves

https://github.com/fluxcd/helm-controller/blob/main/internal/controllers/helmrelease_controller.go

I think this is a really insightful comment. Has anyone reviewed the controller? I'm curious how much of this controller do we think we need to implement for a robust solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emosbaugh,

If we use flux, I suggest we never modify the Kubernetes resources themselves and instead modify values through the HelmChart CR. If we ever had to we can use the suspend property of the CRD

If we choose to adopt Flux, it would imply that we could no longer directly install, delete, or update HelmCharts or Custom Resource Definitions (CRDs) in our controllers. Additionally, we would not be able to manage any resources created and managed by Flux during the installation of these HelmCharts.

However, it's important to note that our business requirements necessitate certain actions. These include ensuring seamless migrations, implementing workarounds for specific issues, and automating steps that are presently manual and only applicable within our particular operational context to guarantee smooth upgrades. Consequently, it's crucial that we retain control over these processes. While Flux is a powerful tool, it doesn't cater to the kind of specific use-cases that we often need to address. Therefore, we need a solution that accommodates our unique needs while providing comprehensive control over our resources.

If we try to manipulated the same resources then you are proposing a solution that violates the SOLID principle and suggests the use of two overlapping Operators, it would introduce potential challenges related to race conditions and concurrency. It's important to note that having two instances of the same Operator is not feasible due to these reasons, and the idea of having two separate Operator solutions responsible for the same tasks clearly goes against the Single Responsibility Principle, as per SOLID principles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chris-sanders,

I've checked a little the helmrelease_controller.go file from FluxCD's Helm Controller. Then, for me although it contains a wide range of features, not all shows that are relevant to our specific needs, and crucially, it doesn't address the unique requirements that we have. As a result, our implementation to meet our business requirements—striving for full automation to minimize human intervention, and utilizing a reduced, opinionated specification—would IHMO not mirror Flux's approach.

If the concern is about the potential effort involved in developing a similar solution, could we not consider Flux's implementation as a reference point? If our paths do align—which I currently do not foresee see too much beyond mainly installing/uninstalling/deleting HelmCharts—Flux's code could serve as a useful guide to expedite our progress. WDTY?

Following a couple of points.

Flux reconciliation logic is not the same approach proposed in the prototype

The approach of our prototype using Helm for reconciliation significantly differs from Flux's diff-based approach (see here). Instead of relying on diffing the current cluster state against the desired state, our prototype takes a different approach as proposed here.

In our prototype, we proposed directly handling reconciliation based on the desired state specified through user input into the Installer/Kurl CR, our defaults, and specific HelmChart versions. Unlike Flux's Diff() function, which shows to calculates the changes needed to reconcile the current state of Kubernetes resources with the state defined in the Helm release, our custom Operator avoids this diffing process.

The diff approach shows that can bring for example:

  • False Positives: Diffing relies on comparing the current cluster state with the desired state defined in configuration files. However, this comparison may result in false positives, indicating differences that are not actual drift. This can lead to unnecessary and potentially disruptive corrective actions.
  • Continuous Drift: Depending solely on diff-based drift detection and correction may result in a continuous cycle of corrections, causing unnecessary system churn and instability.
  • Overwriting Manual Interventions and Customizations: The diff approach shows that cannot consider manual changes or configurations made to the cluster, leading to unintended consequences when the system attempts to automatically correct these changes. This can result in disruptions or loss of customized configurations that could not be addressed via the chart values for example.

Managing Dependencies and Pre-requirements/Conditions

In our custom solution using pure Helm, we prioritize validating that all pre-requisites for installing, uninstalling, or deleting an AddOn are satisfied for each specific scenario. We don't need to rely on a generic mode, which might not comprehensively cover our needs. For instance, while handling dependencies, Flux merely checks if a HelmRelease is installed and declares success (see here).

When we were doing our prototype and demo, we checked that there are circumstances where shows that a HelmRelease might report that a solution is deployed when, in fact, it isn't fully functional or installed properly. In our controller, we wouldn't manage things this way. Instead, we can prioritize validating if ANY specific requirements in line with our business rules and customer demands - are not just deployed, but also HEALTHY and aligned with our particular opniated spec configurations, before performing any next step operation. This degree of detailed health-check is beyond Flux's capabilities for any pre-condition addon or dependency. As a result, our custom solution could provides superior assurance and control and grow according with the problems faced by the users.Therefore the custom approach has the potential to:

  • mitigate scenarios where an attempted solution installation results in an unhealthy cluster state, often necessitating manual intervention—a common issue with the current KURL and a pain point for our users.
  • moreover, we would be able to effectively monitor and report status conditions, indicating when pre-requisite conditions are not met, as proposed here. Thus, our approach using pure Helm enhances stability, user experience, and transparency of system status which is not achievable by adopting Flux.
  • prevent scenarios where an attempted solution installation could lead to an unhealthy cluster state --common user feedback and request. This minimizes the need for manual intervention, a significant issue with the current KURL and a recurrent user concern.

Tracking the current state of the solutions

As previously emphasized, Flux is a generic solution and, consequently, lacks the capacity to comprehensively verify whether each specific Addon installation or upgrade has been successfully completed and is functioning healthily, particularly considering the unique intricacies of each specific solution.

In contrast, by retaining full control over our operations, we could perform granular checks on pre-requisites for each specific Addon, ensuring their individual compatibility and readiness within our specific system. This ties into the approach proposed here. That can enables us to gain granular visibility into system health, pinpoint any issues, identify their causes, and rectify them effectively.

Moreover, as demonstrated in our prototype, using pure Helm enables us to retrieve the status from Helm and convert it into our custom Status conditional structure with just a few lines of code.

Remove the complexities to deal and parse Flux logs

In contrast, Flux would require us as discussed yesterday and shared for those that have been working with in many scenarios parse LOGS and not only monitor its CRS statuses, introducing a new layer of complexity. In POV, it would necessitate significantly more code implementation and maintenance.

Let's now speak about FLUX finalizers implementation

The Flux Controller handles the finalizer, a method that prevents deletion of a Custom Resource (CR) representing the solution on the cluster or the corresponding namespace where the resource is installed. It operates by simply verifying whether Helm has reported that the Chart was uninstalled, as evidenced here.

On the other hand, when using our custom Operator with pure Helm, we have the flexibility to perform additional operations prior to deletion. This allows us to prevent unintended consequences such as data loss, which is discussed in greater detail here. The use of Helm gives us more granular control over the lifecycle of our resources, a level of detail not provided by Flux's approach.

- Helm release [dependencies](https://fluxcd.io/flux/components/helm/helmreleases/#helmrelease-dependencies)
- Support for Helm install, upgrade, test, rollback and uninstall
- Performs Helm uninstall on `HelmRelease` CR removal
- Chart values override\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See that with pure Helm we just need to pass the chart values to install/upgrade and we achieve the same, see here in the prototype.

So, we could either receive the input from the Installer CR and just pass as map for it, i.e.:

	// Override chart values
	chart.Values = valueOverrides

	// Perform the install or upgrade operation
	if releaseExists(releaseName, actionConfig) {
		_, err = upgradeAction.Run(releaseName, chart)
	} else {
		_, err = installAction.Run(chart, nil)
	}

So, I either do not see it as a advantage. Am I overlooking any complexities that Flux handles in these scenarios that I haven't considered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example where the prototype implementation is only a few lines but flux has a much more comprehensive implementation. What's the difference here, is the flux method better and addressing thing we haven't yet addressed or is it more complicated but functionally the same for our use case?

A second question, it looks like in the above prototype code, we replace the chart values with valuesOverrides. Is that true? The intention would be to mergeOverride, IE only override values that are set not replace all values with a new set of values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree with @chris-sanders.

@emosbaugh what are the circumstances that we need to composeValues? Could you share why do you think that we need to do it?

- Support for Helm install, upgrade, test, rollback and uninstall
- Performs Helm uninstall on `HelmRelease` CR removal
- Chart values override\
- Pull charts from Helm repository directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems nice but shows either very easy to do with pure Helm, following the Golang code to do it:

import (
	"fmt"

	"helm.sh/helm/v3/pkg/getter"
	"helm.sh/helm/v3/pkg/repo"
)

func pullChart(repository, chartName, chartVersion, destination string) error {
	// Create a new repository object
	r := &repo.ChartRepository{
		Name: repository,
		URL:  repository,
	}

	// Set up the repository
	settings := cli.New()
	repoFile := settings.RepositoryConfig
	chartPath := repoFile.Cache

	// Load the repository index
	index, err := r.DownloadIndexFile(chartPath)
	if err != nil {
		return fmt.Errorf("failed to download repository index: %w", err)
	}

	// Find the chart in the repository index
	chartVersion, err := index.Get(chartName, chartVersion)
	if err != nil {
		return fmt.Errorf("failed to find chart %s: %w", chartName, err)
	}

	// Initialize the downloader
	downloader := &downloader.ChartDownloader{
		Out:              os.Stdout,
		RepositoryConfig: repoFile,
		RepositoryCache:  chartPath,
		Getters:          getter.All(settings),
	}

	// Download the chart
	filename, err := downloader.DownloadTo(chartVersion, destination, false)
	if err != nil {
		return fmt.Errorf("failed to download chart: %w", err)
	}

	fmt.Printf("Chart downloaded to: %s\n", filename)

	return nil
}

Am I overlooking any complexities that Flux handles in these scenarios that I haven't considered?

persistence:
enabled: true
replicas: 1
mode: standalone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like this one. We can only:

	var data map[string]interface{}

	err := yaml.Unmarshal([]byte(yamlData), &data)
	if err != nil {
		fmt.Printf("Failed to parse YAML: %v", err)
		return
	}

And then we will have the map[string]interface{} variable to install, upgrade the HelmChart with pure Helm either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to merge this config with whatever our default config is, too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reconcile the recently proposed versioning scheme with this API.

- It will install and confiure the KOTS admin console.
- It can be installed on an existing [kURL](https://kurl.sh/) cluster with only Kubernetes, containerd and flannel.
- The installation mechanism for add-ons will be through the use of Helm Charts.
- By default the add-on configuration will be opinionated by us but we will enable the end-user to override the configuration for maximum flexibility. Vendors and users will be able to override the configuration using a standard [Helm values](https://helm.sh/docs/chart_template_guide/values_files/) file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need either call down the complexities that we will have within just for we are all aware of:

  • a) Users might add invalid data we will need to validate those inputs
  • b) Upper version of HelmCharts might need new values or no longer work with values informed previously which means that is a scenario that we need to handle into upgrades
  • c) Some AddOns configuration might not work well with other AddOns configuration, that is something that is very complex to prevent or test out. We either will not able to test all possible values combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let me add one thing: we are doing a lot of effort to ensure we can guarantee seamless cluster upgrades and IMHO to expose values.yaml would render the same situation we have nowadays. We won't be able to test all possible combinations during upgrades.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can foresee the pitfalls with allowing users to pass/override the values for an add-on. However, there is a need to allow end-users to override certain add-on configs for e.g. Rook (filters for OSD devices) and containerd (config for GPU support) to name a few. We can either i) restrict what configs can be overidden or ii) allow unmitigated access to override values but with the caveat that users "proceed at their own risk".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrpolanco, both @ricardomaraschini and I seems agree with your perspective We can either i) restrict what configs can be overidden or ii) allow unmitigated access to override values but with the caveat that users "proceed at their own risk"..

Considering the inherent risks and the specific business requirements that kURL aims to address - such as facilitating seamless upgrades, automating workarounds, and minimizing complexities for users - it's crucial to underscore the potential challenges that arise when users are permitted to add any chart value.

This flexibility comes with its own set of complexities: verifying the validity of these values, ensuring compatibility between various solutions, and handling possible conflicts. For instance, accommodating both OpenEBS and Rook in the same cluster with any given configuration could pose challenges and that we cannot test all possible combinations. Furthermore, we need to address what happens during upgrades when a particular value is no longer valid.

Balancing user freedom and system stability is a delicate act, and it's paramount that we consider these factors to avoid complicating the user experience or introducing instability to the system.

Then, I think we can propose this solution and call out all complexities and problems inherent to this approach as add an alternative option where we limit the possible options within the pros and cons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kURL cannot gatekeep add-on configuration or else we will fall into the same scenario with the current kURL offering which is that we have to release a new feature every time a vendor/end-user wants to configure something outside of the normal use case. E.g. configuring GPU support for containerd.
I think a reasonable approach is to document our opinionated (and supported) configuration (values.yaml) for each add-on and allow vendors/users to overwrite them with the warning that their applied config may not be supported (i.e. they can proceed but at their own risk).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally OK with that.
I just think that in the ADR we should explicitly describe the limitations and potential problems. Just that. we need to keep it write down.


## Consequences

- A kURL CLI will be needed to install the operator dependencies (if any)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To install the Operator we only need the k8s resources (which can be shipped in a single file and used via raw) and the Operator image.

Therefore, besides be very cool and not so hard we have a CLI, we have a amazing prototype for that already is not be necessary to have a CLI to provide the Operator, see some options:

We can do such as Prometheus Operator does

Also, they allow download from the release page, see: https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.65.1/bundle.yaml

  • We can either have a small shell script to just ship the Operator
  • We can either provide a tar with the instructions to untar and install

This approach affords us with a controller that meets our specific needs and avoids any third party software dependency.



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the image here? WDYT? Or any other image/diagram if you be able to do a better one :-)

Screenshot 2023-04-28 at 10 05 22

Suggested change
<img width="776" alt="Screenshot 2023-04-28 at 10 05 22" src="https://user-images.githubusercontent.com/7708031/235105356-2d499fcd-8089-44f8-9a90-77547d6cb5cc.png">


_Use the Helm Controller provided by Flux_

The [Flux Helm Controller](https://fluxcd.io/flux/components/helm/helmreleases/) affords us the following benefits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage I see with Flux is that it provides a code implementation for reconciling HelmCharts and includes watches and finalizers, as demonstrated in their helmrelease_controller.go file on GitHub.

However, it comes with the limitation that we cannot modify it according to our specific needs. The controller is generic and may not work effectively with all HelmCharts. By implementing our own solution per Addon, we can automate solutions based on the specific requirements of each Addon. This allows us to address specific needs and requirements more accurately, as a generic implementation will not perform in the same way. It's important to note that a generic automation for a HelmChart will never be able to handle specific issues, workarounds, or tailor solutions that are commonly done in KURL for what we ship.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the controller does have a dependsOn property

https://fluxcd.io/flux/components/helm/helmreleases/#helmrelease-dependencies

i can see us taking advantage of this in conjunction with helm tests for creating pre-requirement checks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to note that a generic automation for a HelmChart will never be able to handle specific issues, workarounds, or tailor solutions that are commonly done in KURL for what we ship

Can you provide an example of what you mean by this? Right now it's just a high level specific > generic and I'm not clear that's true in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-sanders,

Can you provide an example of what you mean by this? Right now it's just a high level specific > generic and I'm not clear that's true in this case.

Flux, being a generic solution, is engineered to manage a wide variety of scenarios and use cases, offering a standardized methodology. However, in KURL, our goal is to automate the process and resolve specific scenarios, ensuring our users can install and upgrade AddOns seamlessly. These operations are highly specific to the AddOns and their respective versions.

So, to better elucidate my concerns, here are a couple of examples:

Automating Workarounds to Handle Bugs and Issues

In the following KURL code, we delete resources as a workaround to certain issues: Link to KURL code

However, with Flux, situations like these could trigger the Flux control loop to run again, potentially overwriting our changes since it's designed to ensure the HelmChart is properly installed. Another risk is that if both we and Flux try to update and monitor the same resources, we could encounter race conditions.

Upgrading Versions and Handling Breaking Changes

Upgrading AddOn solutions can lead to breaking changes. Flux, being a generic solution, may not be equipped to handle these breaking changes that require a more nuanced approach. For instance, we sometimes need to scale down the rook operator to implement changes.

I hope that clarifies my concerns.

_Develop our own Homebrew Helm Controller_

This approach affords us with a controller that meets our specific needs and avoids any third party software dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a sub item here with: https://github.com/replicatedhq/kURL/pull/4488/files#r1192957945 (i.e. Long Term considerations )


The add-on controller will be responsible for
- verifying the requirements for the add-on are met
- managing the add-on itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- managing the add-on itself
- managing the add-on itself and automate the specific needs for each AddOn solution

- It will be developed using [SOLID](https://en.wikipedia.org/wiki/SOLID) principles and operator best practices.
- It will distribution agnostic. I.e. it will not be tightly coupled to the kURL Kubernetes distribution.
- It will manage application add-ons _only_ and _exclude_ managing CRI and CNI resources.
- It will install and confiure the KOTS admin console.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo here in configure.

- It will install and confiure the KOTS admin console.
- It can be installed on an existing [kURL](https://kurl.sh/) cluster with only Kubernetes, containerd and flannel.
- The installation mechanism for add-ons will be through the use of Helm Charts.
- By default the add-on configuration will be opinionated by us but we will enable the end-user to override the configuration for maximum flexibility. Vendors and users will be able to override the configuration using a standard [Helm values](https://helm.sh/docs/chart_template_guide/values_files/) file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let me add one thing: we are doing a lot of effort to ensure we can guarantee seamless cluster upgrades and IMHO to expose values.yaml would render the same situation we have nowadays. We won't be able to test all possible combinations during upgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::docs Improvements or additions to documentation type::feature An enhancement to an existing add on or feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants