Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-allocate storage for metadata json files, see containers/podman#13967 #1480

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

chilikk
Copy link

@chilikk chilikk commented Jan 25, 2023

This change seems to fix the scenario described in containers/podman#17198

I would appreciate advice on writing a testcase though, I would imagine that delete_container testcase would be a good starting point, however I lack the knowledge of the test environment and don't know if it is feasible to fill the filesystem or initialize a new filesystem and mount it.

…3967

Keeping a temporary file of at least the same size as the target file
for atomic writes helps reduce the probability of running out of space
when deleting entities from corresponding metadata in a disk full
scenario. This strategy is applied to writing containers.json,
layers.json, images.json and mountpoints.json.

Signed-off-by: Denys Knertser <denys@netze.io>
@rhatdan
Copy link
Member

rhatdan commented Jan 26, 2023

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

random := ""
// need predictable name when pre-allocated
if !opts.PreAllocate {
random = strconv.FormatUint(rand.Uint64(), 36)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a sufficient replacement for CreateTemp. I’m not immediately sure if this is called from contexts where that matters.

@@ -17,6 +19,9 @@ type AtomicFileWriterOptions struct {
// On successful return from Close() this is set to the mtime of the
// newly written file.
ModTime time.Time
// Whenever an atomic file is successfully written create a temporary
// file of the same size in order to pre-allocate storage for the next write
PreAllocate bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PreAllocate and !PreAllocate code paths differ so much, and very importantly they have different security implications (PreAllocate must not be used in directories with hostile writers, like */tmp), that I don’t think it makes much sense for them to use the same object at all.

The comparatively small parts of the code that can be shared between the two can be shared another way (e.g. using a shared helper function).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PreAllocate must not be used in directories with hostile writers

Not just that, PreAllocate is also only correct if there is some external-to-pkg/ioutils locking that ensures serialization of uses of AtomicFileWriter. (That’s fine for the container/image/layer stores, we have locks there. But it’s a new responsibility for callers of this code, and it needs to be very clearly documented.)

@chilikk
Copy link
Author

chilikk commented Jan 26, 2023

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I can totally see your point regarding the complexity.

However, I do think that this is a useful feature to be able to remove specific containers/images once the disk is full, rather than the discussed alternative to perform a system reset dropping all containers with their state, volumes etc. It might not matter that much in a desktop environment (and could be fixed by freeing some less important files, as suggested, because the containers storage would often be located on a root partition), but in a production environment it is desirable to be able to limit the amount of affected services.

From what I have been able to gather so far Podman needs additional space on podman rm when 1) updating the Bolt database 2) updating the containers.json, layers.json and mountpoints.json. Assuming /var/lib/containers is already located on a dedicated partition, it is possible to mitigate (1) by setting up a dedicated partition for /var/lib/containers/storage/libpod. However, such solution is not applicable to (2) because the *.json files are located in the same directory as the actual data they describe.

Moreover, an atomic write of the *.json-files as it is implemented now requires roughly as much free space as the size of the original file which can easily go up to 100s of kBs even with a handful of containers. This means that the space requirements for such operation are much larger than a Bolt DB update meaning the likelihood of encountering an out of disk situation when the disk is low is much higher on metadata *.json atomic writes than updating the BoltDB.

By keeping a temporary file of the same size around we can guarantee that an atomic write of a smaller file (which is the case with podman rm) will succeed with regards to disk space. We cannot always be sure that a temporary file of the same size can always be allocated at write, so this is not a bullet-proof solution, but it still greatly reduces the probability of ending up in a situation where the containers and all their data must be re-created from scratch.

So, while I agree that one would try to avoid the complexity, I think it might be justified if it allows to solve the actual problem. Running out of disk space might not be a very often encountered scenario, but when it happens it can be very frustrating.

@giuseppe
Copy link
Member

My initial impression is that this way too much complexity to handle a desperate situation that could be much more reliably handled by something like deleting a file in /var/log. There well may be several other places that require extra space to free large objects, and without a specific feature test to ensure that this keeps working, we are quite likely to add more of them without ever noticing.

I guess all of this code is likely to more frequently break something in unexpected / untested situations, than to help in situations it is intended for. E.g. the non-Linux path is almost certainly not going to be sufficiently tested, and the behavior difference is likely to be missed by future readers. Admittedly that is a very vague and completely unquantified feeling.

So this is not quite a “no” … but it is almost one.

I agree with @mtrmac. There are so many other places where it could go wrong that I am not sure it is worth to add all this complexity that is difficult to test just for one instance.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 27, 2023

Or maybe … if we are serious about this, there really needs to be an end-to-end Podman tests that ensures the actually desired functionality (removing a container when completely out of disk space) keeps working.

(I realize that this is a bit circular — we need this PR, and maybe other changes, before that test can be added. So, that means either doing all that work, and then discussing whether it is worth maintaining it, or discussing that now, with some hypothetical guess at how much work that is.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants