Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bundle persistence in cloud native environments #4782

Closed
FredrikAppelros opened this issue Jun 15, 2022 · 25 comments · Fixed by #4786
Closed

Allow bundle persistence in cloud native environments #4782

FredrikAppelros opened this issue Jun 15, 2022 · 25 comments · Fixed by #4786

Comments

@FredrikAppelros
Copy link
Contributor

FredrikAppelros commented Jun 15, 2022

What part of OPA would you like to see improved?

Make the feature of persisting activated bundles to disk usable in cloud environments.

Describe the ideal solution

While a great feature, the ability to persist activated bundles to disk is not usable in a cloud environment where multiple instances of OPA may be trying to write to the same shared storage as the current implementation is not synchronized. This can be solved by using an advisory file lock on the destination file. This would then be propagated through file systems like NFS to ensure that only one client is able to write to the file.

Describe a "Good Enough" solution

The ideal solution is trivial enough as is.

Additional Context

In order to fix this one would need to first choose a file locking library, as there is no such facility in the Go standard library. One candidate is to use this module which exposes some higher level functions based on the underlying internal implementation in Go.

@FredrikAppelros
Copy link
Contributor Author

I believe one could simply change the call here with this implementation instead.

@Joffref
Copy link
Contributor

Joffref commented Jun 15, 2022

Do you mean something like this : Joffref@cf2f652

I didn't try it, but just give me a quick feedback if you have time to do so ! 💯

@srenatus
Copy link
Contributor

Thanks for the detailed feature request.

OPA really isn't designed to share a file system with other instances. So while it might work to introduce locking, what about other things like disk persistence, or some future file-based decision logger plugin? Also, having multiple instances share a directory isn't really tested at all.

I think your best bet is to avoid this. Give each OPA instance its own directory.

What is it that you've been trying to achieve here? I don't suppose it's about disk usage, is it? I'm afraid OPA wouldn't watch the persistence dir for updates anyway, so this wouldn't work as a synchronization method... 🤔

@Joffref
Copy link
Contributor

Joffref commented Jun 16, 2022

Maybe this tool is fitted for your use case.
It's designed to manage OPA Cluster providing them bundle consistency and it's able to aggregate data from multiple data-sources into your OPA.

@FredrikAppelros
Copy link
Contributor Author

Initially, I thought that could be a solution, yes, but then I looked a bit closer and realized that it will not be enough, since the current persistence mechanism tries to atomically update the persisted bundle file using rename() (or rather renameat()). The scenario that I am afraid might happen is this type of access:

p1: open and trunc tmp file
p1: write to tmp file
p2: open and trunc tmp file
p1: rename tmp file to dst
p2: write to tmp file
p2: rename tmp file to dst

I believe this could lead to the destination file being both truncated and actively written to, making the atomicity of rename() moot.

As I see it, there are two possible solutions to this with some different pros and cons:

Solution 1

Avoid using any file locks but change the implementation so that each instance operates in its own unique temporary file. This means that each instance can write to their own file without the risk of truncation or corruption and then we rely upon the fact that rename() is an atomic operation. Since we are storing the temporary file in the same directory as the persisted bundle (and hence within the same file system) the guarantee of atomicity should still hold, as long as the file system is mounted properly.

Pros:

  • No file locks needed and hence no new dependencies

Cons:

  • May increase storage requirements as it will be required to hold n + 1 times the bundle size as n instances may be trying to download the bundle at once

Solution 2

Introduce a file lock around the temporary file for the entirety of the process of opening, truncating, writing to and renaming the file.

Pros:

  • The same storage requirements as before applies (2 times the bundle size)

Cons:

  • Requires a file lock during the entire process of updating the persisted bundle file, which may affect performance negatively

@FredrikAppelros
Copy link
Contributor Author

Thanks for the suggestions on some alternative solutions! The goal here is to make OPA more robust so that it can actually utilize the bundle persistence feature in a cloud environment on its own without any other supporting services.

I know that this can be solved by building tools around OPA, rather than having this supported inside OPA itself (which is actually what we are doing today already). However, this particular feature is something that I believe can easily be supported by OPA with some minimal changes. I also think it is in line with OPA being touted as cloud native.

I mean, OPA is a CNCF graduated project and the tagline is literally:

Policy-based control for cloud native environments

@FredrikAppelros
Copy link
Contributor Author

Since storage is usually pretty cheap, I would actually opt for solution 1 that I outlined above. If you agree to my reasoning for this change I can send a PR your way.

@anderseknert
Copy link
Member

I also think it is in line with OPA being touted as cloud native.
I mean, OPA is a CNCF graduated project and the tagline is literally:
Policy-based control for cloud native environments

So, whether a project may be called "cloud native" or not basically comes down to their willingness to support shared instance state using a distributed file system protocol from 1984?

@FredrikAppelros
Copy link
Contributor Author

@anderseknert: I am not trying to say anything of the sort. I am merely trying to contribute to OPA and make it more robust in cloud environments.

If you try to deploy OPA the way it is currently implemented in Kubernetes with a common cloud storage driver and try to use the bundle persistence feature, then the end result may be that you get corrupt bundles in your persistent storage.

I have already offered my help at fixing this and can send you a PR, if you are interested.

@FredrikAppelros
Copy link
Contributor Author

@srenatus: Perhaps I should elaborate that this is not intended to be used as some sort of way to synchronize the bundles across the instances, but rather as a disaster recovery mechanism, as I believe the feature originally was intended.

FredrikAppelros added a commit to FredrikAppelros/opa that referenced this issue Jun 16, 2022
In order to use the feature to persist activated bundles to disk in a
cloud environment with shared storage, e.g. Kubernetes with the Amazon
EFS storage driver, each instance of OPA needs to either synchronize
their access to the temporary file using advisory file locks, or use
unique temporary files. If not, then the following situation may occur:

p1: open and trunc tmp file
p1: write to tmp file
p2: open and trunc tmp file
p1: rename tmp file to dst
p2: write to tmp file
p2: rename tmp file to dst

This may then lead to the persisted bundle being truncated or corrupted.

Here the approach of using unique temporary files is chosen because it
avoids the overhead of introducing file locks, and the additional
dependency since Go lacks any such mechanisms in the standard library.

This solution should avoid truncated or corrupt bundles as `rename()` is
guaranteed to be atomic, even in file systems like NFS.

Fixes: open-policy-agent#4782
Signed-off-by: Fredrik Appelros <fredrik.appelros@sinch.com>
@srenatus
Copy link
Contributor

@srenatus: Perhaps I should elaborate that this is not intended to be used as some sort of way to synchronize the bundles across the instances, but rather as a disaster recovery mechanism, as I believe the feature originally was intended.

Thanks, all context is good. Could you elaborate on how you'd like to leverage the DR mechanism? I think bundle persistence was intended to allow a restarting OPA instance to quickly pick up where it left off before, without having to refetch bundles. Would this not be possible without sharing a file system? 🤔

@FredrikAppelros
Copy link
Contributor Author

FredrikAppelros commented Jun 17, 2022

@srenatus: I am thinking it would be leveraged as designed today, just in a cloud environment where there are multiple instances deployed to make the application resilient to failures. It is not so much that it is not possible without sharing a file system, but rather how persistent storage works in most cloud environments.

If you take Kubernetes as an example then you provide access to persistent storage through PersistentVolumeClaims (PVCs) which in turn uses an underlying storage driver to provision the actual persistent volume. These drivers typically provision these volumes using some form of shared storage solution, e.g. Amazon EBS or EFS. Since any pods running in the cluster are seen as disposable, they may generally move between nodes freely (within its affinities) and hence shared storage is the only solution that makes sense in this type of deployment.

One could argue that a local storage class could be used for this, however, this would decrease durability and make management more difficult as this method does not support dynamic provisioning.

Imagine that you have a large cluster where nodes are swapped in and out dynamically, making it impossible to manage this statically, and you want to provide durability to the application so that OPA is able to start up and activate the last known bundle without any dependent services. That is what this fix is trying to address.

@ashutosh-narkar
Copy link
Member

@FredrikAppelros you mentioned the below as a potential solution and that is what your PR seems to do as well.

Avoid using any file locks but change the implementation so that each instance operates in its own unique temporary file.

So if the intent is that each OPA instance uses a unique temporary file to avoid corruption, do you think if we configure the persistence directory that each OPA uses to persist bundles to be unique this would be enough to resolve any conflicts ? Currently when you start OPA, you can set the persistence_directory where a persisted bundle is stored. If there are multiple instances of OPA each configured with a unique persistence directory, this should work in a cloud environment, correct?

@FredrikAppelros
Copy link
Contributor Author

Currently when you start OPA, you can set the persistence_directory where a persisted bundle is stored. If there are multiple instances of OPA each configured with a unique persistence directory, this should work in a cloud environment, correct?

@ashutosh-narkar: In theory, yes, that should solve the issue, but again, in practice, given how Kubernetes works it would not be possible to manage. You do not configure each pod on their own as they are managed transparently by a ReplicaSet which in turn may be affected by e.g. your auto-scaling policies. Given the dynamic nature of a cloud environment this needs to be done without manual configuration.

@srenatus
Copy link
Contributor

💭 I think what I don't understand is the assumption that the persistence directory is something that needs to be taken care of at all. I would think that keeping it in the local ephemeral file system is fine. When an OPA instance goes away, so does its persistence directory; when a new one is spun up, it'll have a new persistence directory. The mechanism for doing that in an automatic way is bundles, since the instances will fetch their bundles on startup. But I'm no cloud architect, maybe this isn't a feasible approach for your setup. Why is it not?

@FredrikAppelros
Copy link
Contributor Author

@srenatus: So the issue with keeping it in the node-local ephemeral storage is the fact that pods are disposable, as I previously mentioned. There is no guarantee that the new pod will be spun up on the same node and hence it might not be able to access the persisted bundle. If this happens, we have a dependency on an external bundle service in order to start up new instances, which is exactly what I believe the bundle persistence feature aims to avoid.

As I have tried to explain before, the goal here is making OPA more robust in a cloud environment, ensuring that it can continue operating on cached data in the event of a failure.

@ashutosh-narkar
Copy link
Member

Given the dynamic nature of a cloud environment this needs to be done without manual configuration.

In your PR, you're using a unique name for the temp bundle file. If you could dynamically set the persistence_directory would this be a solution ?

@srenatus
Copy link
Contributor

I think the point is to keep the deterministic persistence directory, to be able to start new instances with a pre-filled persistence directory.

To keep the orchestration needs down for this, they're using a shared file system. What goes wrong with the current code is concurrent writes to that file system.

@srenatus
Copy link
Contributor

If this happens, we have a dependency on an external bundle service in order to start up new instances, which is exactly what I believe the bundle persistence feature aims to avoid.

I'd think the bundle service is critical anyways, for proper functioning... But yeah I think that's the crux here, isn't it?

@FredrikAppelros
Copy link
Contributor Author

@srenatus: Exactly, right now the bundle service is critical, but it doesn't have to be. One can usually live with some stale authorization data rather than no data at all.

Have I misunderstood what the bundle persistence feature is about? I would imagine this is the exact use case it tries to prevent, only that it was not designed to function with shared storage. In your documentation it says the following about this feature:

This allows OPA to start with the most recently activated bundle in case OPA cannot communicate with the bundle server.

@srenatus
Copy link
Contributor

Have I misunderstood what the bundle persistence feature is about?

I'd like to let @ashutosh-narkar take this one. I think your rename approach in the PR is a decent way to go.

I'm still somewhat hesitant about sharing the persistence directory... it's used by the OCI bundle downloader, too, and I don't know if those code paths are happy with having multiple instances share a directory. (I don't know if they use renames when writing to that dir, or if they lock something; or if they even have to.) -- @carabasdaniel do you know if the oci directory can be shared among concurrent instances of OPA?

@FredrikAppelros
Copy link
Contributor Author

I think the associated PR for this issue is becoming a bit stale. Is there anything I can do to speed up the process?

@srenatus
Copy link
Contributor

I think it boils down to this: there's a not supported deployment scenario, having multiple OPA instances share a persistence dir. If that's done, the fix you've proposed is what's necessary to ensure smooth sailing when using bundles.

I'm OK merging your PR, since it clearly unblocks you at a not-too-high cost. What I'm hesitant about is signing off on the deployment scenario. I'm wondering if merging the PR would be understood as supporting it, increasing head aches all round 🤔

@ashutosh-narkar what's your take on this?

@ashutosh-narkar
Copy link
Member

I think the PR on its own is good to go. OPA tries to load and activate persisted bundles on a best-effort basis and so OPA would ideally always look to get the bundles from the server. The deployment scenario is something that the OPA operator decides how they want to run OPA, so this change in itself (ie. using a unique name for the temp file) seems it can be separated from the discussion about how the OPA gets deployed and should be only be looked from the pov of how OPA names temp files going ahead. Merging the PR while adhering to the best-effort persistence guarantees and not implying that we support some advanced or numerous deployment models would be the way to go imo.

srenatus pushed a commit that referenced this issue Jul 12, 2022
In order to use the feature to persist activated bundles to disk in a
cloud environment with shared storage, e.g. Kubernetes with the Amazon
EFS storage driver, each instance of OPA needs to either synchronize
their access to the temporary file using advisory file locks, or use
unique temporary files. If not, then the following situation may occur:

p1: open and trunc tmp file
p1: write to tmp file
p2: open and trunc tmp file
p1: rename tmp file to dst
p2: write to tmp file
p2: rename tmp file to dst

This may then lead to the persisted bundle being truncated or corrupted.

Here the approach of using unique temporary files is chosen because it
avoids the overhead of introducing file locks, and the additional
dependency since Go lacks any such mechanisms in the standard library.

This solution should avoid truncated or corrupt bundles as `rename()` is
guaranteed to be atomic, even in file systems like NFS.

Fixes: #4782

Signed-off-by: Fredrik Appelros <fredrik.appelros@sinch.com>
@srenatus
Copy link
Contributor

👏 thank you for your contribution @FredrikAppelros

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants