Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Being-saved images create errors up the stack #595

Open
nalind opened this issue Apr 13, 2020 · 20 comments
Open

Being-saved images create errors up the stack #595

nalind opened this issue Apr 13, 2020 · 20 comments
Labels

Comments

@nalind
Copy link
Member

nalind commented Apr 13, 2020

Currently, to commit an image in the image library's containers-storage transport, we create and populate layers, create the image record, and then save the non-layers under the directory that matches that image record's ID. The manifest and config blobs are among those non-layers.

This creates a small window where there's an image record without a manifest and/or config blob, and if another thread tries to read information about the image during that window, it will get unexpected errors, and those errors percolating up the stack is a problem.

Option 1: add an incomplete flag that we use for screening out values that we'd otherwise return in an Images() list, and clean them up when we get a write lock, as we do for incomplete layers. Won't work because we don't keep the lock during the commit operation, so things can get cleaned up from under us during the window.
Option 2: screen out values from the Images() list if they don't include a manifest. The commit logic doesn't need to be changed because it gets the ID when it creates the image record, and uses that ID to save the manifest for the image. Downsides: if we crash, the phantom image record could stay there forever, since it would be filtered for everyone who didn't know its ID. The screening would also need to check for a config blob, which requires knowledge that lives in containers/image (schema 1 images don't have them, and logic for determining what kind of manifest we have is in containers/image), which we can't use as a dependency.
Option 3: if we fail to read an image's manifest, spin and try again. Downside: hacky, and we can only guess how long we should spin before giving up and reporting an error.
Option 4: expose a way to manipulate Flags in the image record, add a flag that containers/image will set on works in progress, and teach everyone to skip over images that have the flag set. Downside: kind of sucks for everyone.
Option 5: ???

@rhatdan
Copy link
Member

rhatdan commented Apr 13, 2020

@TomSweeneyRedHat
Copy link
Member

For option 1, could we add a lock to the commit operation to avoid the issue you're concerned about?

@mrunalp
Copy link
Contributor

mrunalp commented Apr 13, 2020

I like option 2. Can we run some kind of gc from daemons like CRI-O periodically to remove the phantom records?

@giuseppe
Copy link
Member

could we write these files before the image is recorded? We'd still need to have some kind of GC if these files are written but the image wasn't recorded.

Another option would be to use a staging area, similarly to OSTree, so we can write these files to a temporary (but guessable) location, e.g.: /var/lib/containers/storage/overlay-images/staging/IMAGE-ID/manifest

When we lookup for the manifest, we'd need to check like:

_, err := os.Open("$STORAGE/overlay-images/$ID/manifest")
if err != nil && os.IsNotExist(err) {
    _, err := os.Open("$STORAGE/overlay-images/staging/$ID/manifest")
    // We need to try again if the file was moved before opening the staging path
    if err != nil && os.IsNotExist(err) {
        _, err := os.Open("$STORAGE/overlay-images/$ID/manifest")
        if err != nil {
            return err
        }
    }
}

The advantage is that at startup we could just rm -rf $STORAGE/overlay-images/staging

@vrothberg
Copy link
Member

I wished we would not have this race window at all but I guess that's a natural part of sharing data across processes. I am not yet entirely hooked on the proposals, so I want to throw in another idea.

The problem reminds me of the block-copy detection mechanism we have been working on in c/image for a while. Mainly because we need to wait until processing data with a unique ID has been finished. As we have a unique ID, we can generate image-specific lockfiles ad-hoc. This somehow lets me wish for us being able to keep a lock for the image until we are finished committing and others can block on that. Our lockfile implementation allows for recursive locking, so we could wire that in without deadlocking ourselves.

Garbage collection would work implicitly. Let's assume process P1 commits image Img. Process P2 runs GetImage(...) which blocks as process P1 owns the image-specific lock of Img. P1 dies for some reason and leaves Img in a state where the non-layer data (e.g., the manifest) is not committed to the storage. As soon as P1 dies, P2 will be unblocked and can access the image. P1 will find out that the manifest is missing. This (clearly) indicates that something went south while committing, and P2 can delete the image. It would be race free as P2 still owns the lock (and the API would use recursive-locking) so a process can call and lock multiple times.

The challenge would be that CreateImage had to return a handle to lock (or we wire that in via the options). The handle would then unlock the image-specific lock:

img, handle, err := storage.NewImage(....)
defer handle.Unlock()

I may have overlooked a detail but it sounds sweet at the moment.

@mtrmac
Copy link
Collaborator

mtrmac commented Apr 14, 2020

This creates a small window where there's an image record without a manifest and/or config blob, and if another thread tries to read information about the image during that window, it will get unexpected errors, and those errors percolating up the stack is a problem.

Do I understand right that this is primarily problem for “list all images” users (and users who guess the ID in advance and use it directly)? For users that refer to a single image by name, we could easily enough defer the SetNames to only happen at the end when the image is complete, and until then the image might just as well not exist. But the “list all images” users need to know, and they tend to hard-code fairly high amount of knowledge about the storage structure.


Overall, I fairly strongly prefer “option 1”-like behavior, because leaving incomplete images that are not ever automatically deleted around “kind of sucks for everyone”, things like image {prune, tree,history} — which do need to include the incomplete images in the list, if only to account for the current storage use correctly.

As for the locking, is that so infeasible? It probably is with the current API, but a
CreateImageAtomically(expectedID, topLayer, bigData map[string][]byte], names []string) seems plausible at a first glance.

@giuseppe
Copy link
Member

As for the locking, is that so infeasible? It probably is with the current API, but a
CreateImageAtomically(expectedID, topLayer, bigData map[string][]byte], names []string) seems plausible at a first glance.

if a new API is acceptable, I like this solution

@rhatdan
Copy link
Member

rhatdan commented Aug 3, 2020

@nalind @mtrmac @giuseppe @vrothberg @saschagrunert This Issue seems to have been dropped, we seemed to be coming to a decent solution, but then everyone went off into their own priorities and this was never fixed.
Do we have concensus on a fix, in adding a new API call? Can someone implement this new API Call?

@vrothberg
Copy link
Member

CreateImageAtomically(expectedID, topLayer, bigData map[string][]byte], names []string) sounds good to me.

As for implementing, I suggest to create a card internally unless @saschagrunert has time to tackle it.

@saschagrunert
Copy link
Member

I think I can look into it in the next weeks, if someone else is faster than me then feel free to assign this one :)

@rhatdan
Copy link
Member

rhatdan commented Oct 7, 2020

@nalind @saschagrunert Any progress on this?

@saschagrunert
Copy link
Member

Not until now unfortunately.

@rhatdan
Copy link
Member

rhatdan commented Jun 14, 2021

@nalind @saschagrunert Any progress on this?

@cevich
Copy link
Member

cevich commented Jun 22, 2022

Checking: We're not sure, but Miloslav suggested I may be seeing this issue when building multiarch images (using buildah) in Cirrus-cron jobs. For whatever reason, it tends to happen once/twice a week on the 'hello' image build like in this example.. Error message reported is:

loading manifest for target platform: reading manifest for image
instance "sha256:481...a73": error locating item named
"manifest-sha256:481...a73" for image with ID "713...24e" (consider
removing the image to resolve the issue): file does not exist

Does this seem related or do I need to go find another tree to bark at?

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 22, 2022

When pulling/importing images to named tags, we only add the tag after the image is “complete”.

In the meantime, though, it can still be looked up by manifest digest (after that manifest is recorded into store), or by image ID (even before any manifest is uploaded, and the image ID is deterministic for pulled images — not for built images).

So, it possible to:

  • refer to the image using a manifest digest of the top-level manifest list. This digest reference finds an image, even though it is (possibly not yet) tagged. The top-level manifest list is successfully read.
  • The top-level manifest list refers to a per-arch manifest (“reading manifest for image instance” points at this being the case).
  • The per-arch manifest might not yet be recorded in c/storage, and reading it fails like the above.

The might be applicable if there are two processes, one is pulling the image, and another is referring to it (using a known top-level digest?) and finds the not-yet-complete image.

One clearly problematic thing the storageImageDestination.Commit code is doing is that it first records the top-level manifest list, if any, and only afterwards it records the per-arch manifest. So, there’s a window where a second process could see the top-level manifest list but not the per-arch manifest.

All of this is somewhat a special case of the larger discussion — it would be easy to switch the two manifest writes (and we should do that), but similarly the image doesn’t yet have signatures recorded. In principle, all of this should be one locked write.


I don’t immediately see how this relates to the log, though — doesn’t this show the build completely succeeding, and then some larger, non-parallel operation failing? I can’t trivially see the failing command or even attribute it to a specific script. But overall this looks a bit as if the parallel build just produced a broken image, somehow… and this failure is not racing against anything else running concurrently, is it?

I’m afraid I’m not familiar with the Buildah/Podman multi-arch-manifest code.

@cevich
Copy link
Member

cevich commented Jun 22, 2022

The might be applicable if there are two processes, one is pulling the image, and another is referring to it (using a known top-level digest?) and finds the not-yet-complete image.

For instance, when building a multi-arch image, and giving the --jobs argument, eh?
(I mention this, because it's precisely what we do in Cirrus to build these images).

This also strikes me as maybe related to containers/buildah#3710 and/or containers/buildah#3791

@cevich
Copy link
Member

cevich commented Jun 22, 2022

I’m afraid I’m not familiar with the Buildah/Podman multi-arch-manifest code.

The scripts are more/less a wrapper around buildah --layers --force-rm --jobs="$PARALLEL_JOBS" --platform="$platforms" --manifest="$_fqin" "$CONTEXT" with some extra validation to check that every arch is present in the final manifest list. Assuming that is successful, there may/may-not be some additional tagging of the manifest list followed by pushing it up to a registry.

and this failure is not racing against anything else running concurrently, is it?

No, the only thing running on the machine with any intentional concurrency is the buildah build..., everything else is serial (except maybe the push?). In any case, by my eye, the error I posted seems to appear right after the parallel build, but maybe before gluing together the manifest list? Certainly before the scripts try to verify the contents and push.

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 23, 2022

For instance, when building a multi-arch image, and giving the --jobs argument, eh?

(I don’t really know what I’m looking at) My reading of the log is that buildah build (completely) succeeded, then Login Succeeded! looks like another successful command from some script, and Executing mod-command: tag_version.sh also suggests that something after buildah build is happening. So this seems more likely to be something in the tagging or pushing step. (The cause is still pretty likely to be something about the build, but “a build successfully creates an inconsistent multi-arch image” and “we are concurrently seeing an in-progress partial image being pulled” are somewhat distinct, and this issue is more of the latter.)

@cevich
Copy link
Member

cevich commented Jun 23, 2022

Ahh, yes, comparing the output and my scripts, your analysis is correct. That error in the logs appears to originate from a buildah images command, which seems really odd. But that's the only buildah command between build-completion, mod-command completion, and just prior to attempting a push. Since it failed, an empty-string was returned and we get the "No FQIN(s) to be pushed." message. In any case, you're probably correct and this is either unrelated to this issue, or at best well after the build completed (for good or ill)

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 24, 2022

Looking at the multi-platform build, apart from almost-certainly-irrelevant containers/buildah#4079 , the top-level manifest is being built synchronously after the individual instances, so that should not allow this kind of inconsistency. (OTOH I never read how Buildah/Podman deal with manifest list objects, and from what I’ve seen today I’m not sure how the failing instance lookup ever works, so I am certainly missing something important about the code base.)

So I think it’s a different bug, or at least it’s likely enough to be different that it seems worthwhile discussing in a separate (initially Buildah) GitHub issue instead of here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants