Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-process exclusion of readers and writers #1346

Merged
merged 13 commits into from Jan 5, 2023

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Sep 15, 2022

EDIT: See the description in #1346 (comment) instead.

To see what fixing #1332 would be like. This is on top of, and includes, #1345.

The goal is to protect in-process readers against concurrent writes to memory by Load. Let’s discuss the theory of that, and the performance, in #1332 , this is an attempt to draft what that could look like.

(Most of this PR could be merged, and might be useful, without actually changing the locking scheme — but without changing the locking scheme I don’t think we have a reason to do and review all of those code changes).

Right now, locking is open-coded all over the >3000 lines of storage.store:

func (s *store) ListLayerBigData(id string) ([]string, error) {
        lstore, err := s.LayerStore()
        if err != nil {
                return nil, err
        }
        lstores, err := s.ROLayerStores()
        if err != nil {
                return nil, err
        }
        foundLayer := false
        for _, s := range append([]ROLayerStore{lstore}, lstores...) {
                store := s
                store.RLock()
                defer store.Unlock()
                if err := store.ReloadIfChanged(); err != nil {
                        return nil, err
                }
                data, err := store.BigDataNames(id)
                if err == nil {
                        return data, nil
                }
                if store.Exists(id) {
                        foundLayer = true
                }
        }
        if foundLayer {
                return nil, fmt.Errorf("locating big data for layer with ID %q: %w", id, os.ErrNotExist)
        }
        return nil, fmt.Errorf("locating layer with ID %q: %w", id, ErrLayerUnknown)
}

and adding extra code to each of these sites to deal with the in-process lock would be a tedious nightmare with too many opportunities to overlook something..

So, this PR proposes consolidating almost all of these repeated snippets into helpers that call closures: layerReadAccess and layerWriteAccess at the level of a layerStore, and even higher-level (especially for reads) store.writeToLayerStore and readAllLayerStores[T](*store).

The same function can now be:

func (s *store) ListLayerBigData(id string) ([]string, error) {
	foundLayer := false
	if data, err, done := readAllLayerStores(s, func(store roLayerStore, token layerReadToken) ([]string, error, bool) {
		data, err := store.bigDataNames(token, id)
		if err == nil {
			return data, nil, true
		}
		if store.exists(token, id) {
			foundLayer = true
		}
		return nil, nil, false
	}); done {
		return data, err
	}
	if foundLayer {
		return nil, fmt.Errorf("locating big data for layer with ID %q: %w", id, os.ErrNotExist)
	}
	return nil, fmt.Errorf("locating layer with ID %q: %w", id, ErrLayerUnknown)
}

which is < 2/3 of the original size.

At least for the read side, it seems to me that any variant without Go 1.18 generics would need 2 or 3 more tedious lines at every call site. That’s not a show-stopper, but let’s discuss whether we can do it this way. (A notable downside is that the Go 1.18 requirement might not allow us to backport the locking changes to any earlier version … but then with the size of the locking changes, more than almost all of store.go, we might not want to do that anyway?)

After having the closures, we can now quite cheaply add layerReadToken and layerWriteToken types that serve as markers that the layer store’s methods are called with the correct lock held. That’s not nearly as good as a safe language-level support (making the lock-protected fields outright unreachable without holding a lock), but it’s something.


This does not actually show handling of the harder cases, like CreateContainer, where we might need to lock several stores, and the like. I’m confident that can be done — we can always do locking manually instead of through the closures; but I didn’t want to spend time on this without having basic consensus on the approach at all.

Also, there seem to be quite a few opportunities for small-scale optimization, typically to check for ErrLayerUnknown from a call instead of calling Exists before/after the main query. After a few false starts, this PR tries very hard not to touch that logic, because that would ballon the scope of this work to a completely unmanageable and impossible to review size.


See the individual commit messages for details.

I’d very much appreciate

  • opinions on whether this style is reasonable and acceptable overall
  • any suggestions to make the code cleaner. (I never thought I’d say this but I found myself thinking that this would probably be significantly easier to do, and even easier to prove correct, in C++, due to tuples, variadic templates and const.)

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

⚠️ Do not merge as is: On top of the upgrade to Go 1.18, this outright disables all of golangci-lint, because it reports so many warnings, many of them at least a bit valid, and fixing them is not the point of this PR.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

After having the closures, we can now quite cheaply add layerReadToken and layerWriteToken types that serve as markers that the layer store’s methods are called with the correct lock held.

External callers of layerStore could probably be protected easier by being provided a “open with lock” / “operations allowed when locked for reading” / ”operations allowed when locked for writing”, making it impossible to call the wrong methods without a cast.

But that would not protect internal calls within layerStore methods, because regardless of which interface they implement, they would be methods on the underlying object.

Also, the “Token” objects, unlike interface pointers, could in principle carry state to allow diagnosing their misuse (They don’t currently, they are zero-sized structs.). Interfaces pointers are ~stateless.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

For the record, the empty …Token structs are optimized out and cause no extra machine code to be generated, at least in toy examples.

@mtrmac mtrmac force-pushed the process-lock branch 2 times, most recently from f18a6d6 to b618ff3 Compare October 7, 2022 18:04
@mtrmac mtrmac force-pushed the process-lock branch 2 times, most recently from a5e7ef9 to 988704d Compare October 10, 2022 19:56
@mtrmac
Copy link
Collaborator Author

mtrmac commented Oct 10, 2022

Per the #1332 discussion, to see whether this locking has acceptable performance, essentially all of it has to be built first.

So, as much as I like the tokens and the extra bit of safety, that’s a bit too much churn.

So, I’ll be trying:

  • To reduce boilerplate in store as much as possible, at least for the simple cases — to decrease the size of the code and the number of lines that need to be touched by any other changes.
  • To replace {,R,Un}Lock on the C/I/L stores, + a manual ReloadIfChanged, with a higher-level {start,stop}{Reading,Writing}, explicitly so that callers don’t explicitly manage the inter-process locks, and instead express intent. That will also allow us to do cleanup in start* for Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136 — even if the intended operation is a “read” the stores will have the freedom to lock for writing (or to loop obtaining and releasing a lock, if necessary). (The token approach is nice when it matches, but many of the important write code paths would require significant restructuring of the code or the locking scheme, and that’s an explicit non-goal of this effort.)
  • Both of the above could be an improvement in every situation, at least decreasing code size. But we can then change what the locks do, just by modifying the very few start*/stop*/Load methods.

@mtrmac mtrmac force-pushed the process-lock branch 4 times, most recently from 3a9c14a to 309fcfa Compare October 13, 2022 00:57
@mtrmac
Copy link
Collaborator Author

mtrmac commented Oct 13, 2022

I think this is now broadly representative of where I want to get — but I realize the size is completely unmanageable.

Basically the only part worth looking at right now is 309fcfa (“PROTOTYPE: Turn loadMut into inProcessLock”) — that is a very rough sketch of the RFC part; the rest is just infrastructure.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Oct 13, 2022

Next steps are to merge #1345 , then I have a set of thematic branches lined up for PRs.

@mtrmac mtrmac force-pushed the process-lock branch 2 times, most recently from 744008a to 2025168 Compare October 13, 2022 22:07
@vrothberg
Copy link
Member

This is on my todo list but it's a massive PR that requires some uninterrupted time to review which is rare at the moment.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Oct 14, 2022

I’m afraid this PR grew as I was learning things… but basically all of it is trivial infrastructure to just centralize the locking decisions (and I’m filing discovered parts that aren’t infrastructure as separate PRs). I’m also filing that infrastructure as a set of smaller PRs.

The thing I’d like most feedback on is #1346 (comment) — both WRT correctness and performance. (And that’s also the thing that is least finished.)

@rhatdan
Copy link
Member

rhatdan commented Oct 14, 2022

@mtrmac needs a rebase.

@mtrmac mtrmac force-pushed the process-lock branch 3 times, most recently from e19014a to e4ea7dd Compare October 18, 2022 21:58
@mtrmac
Copy link
Collaborator Author

mtrmac commented Dec 13, 2022

Yes, I think this is ready for review (or possible CRI-O benchmarking).

@rhatdan
Copy link
Member

rhatdan commented Dec 14, 2022

@mrunalp @haircommander PTAL

@rhatdan
Copy link
Member

rhatdan commented Dec 14, 2022

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work. LGTM


inProcessLock sync.RWMutex // Can _only_ be obtained with lockfile held.
// The following fields can only be read/written with read/write ownership of inProcessLock, respectively.
// Almost all users should use startReading() or startWriting().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely non-blocking and just an idea. The Image struct in libimage has some fields that may be cached. I think you (@mtrmac) suggested to create an embedded struct to highlight the caching nature of these fields (see https://github.com/containers/common/blob/main/libimage/image.go#L38). Cached fields are always accessed via (*Image).cached.field which IMO makes the code much more expressive and comprehensible.

Would it help to add the fields that require locking in a similar embedded struct? (*containerStore).requiresLock.containers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, that did not occur to me at all, and it’s definitely worth doing.

Copy link
Collaborator Author

@mtrmac mtrmac Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have prototyped this, and it does seem valuable — but it’s a lot of churn.

E.g. on containers.go, the current PR is +80-14 non-whitespace non-comment changes (including some changes with comments), and moving the fields around would be about +51-45 . Also, in addition to C/I/L stores, the top-level store would benefit from the same updates, but store.go is mostly out of scope of this PR.

Overall, I’d prefer to do it in a separate PR a bit later: it would help maintenance but it should not have immediately user-visible impact, which makes it harder for me to prioritize right now — and the churn so far has already been disrupting Nalin’s work, so coordinating a bit better seems warranted.

I have added the suggestion to #1389; it might well be worth tracking as a higher-priority issue separate from the somewhat desperate #1389.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do that in a separate PR. Thank you!

layers.go Outdated
// calls to c/storage API.
//
// Users that need to handle concurrent mount/unmount attempts should not access this
// field at all, and should only user the path returned by .Mount() (and that’s only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/only users/only use/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (after the comments already pointed out)

@haircommander
Copy link
Contributor

I have tasked @wgahnagl with running some performance tests to see how this works. I can follow up here when we have results

@wgahnagl
Copy link

wgahnagl commented Dec 16, 2022

I have tasked @wgahnagl with running some performance tests to see how this works. I can follow up here when we have results

I've run the kubeBurner high-density stresstest against this PR, and it's maxing out at 500 pods, which is normal! 👍

@mtrmac
Copy link
Collaborator Author

mtrmac commented Dec 16, 2022

@wgahnagl Thank you so much.

@rhatdan
Copy link
Member

rhatdan commented Dec 17, 2022

@mtrmac Could you fix the comments and then we can merge and release a new containers/storage.

1 similar comment
@rhatdan
Copy link
Member

rhatdan commented Jan 3, 2023

@mtrmac Could you fix the comments and then we can merge and release a new containers/storage.

Should not change behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Instead of basing this on exclusivity loading via loadMut (which is AFAICS correct),
use a very traditional read-write lock to protect the fields of containerStore.
This is primarily so that all three stores use the same design.

Also explicitly document how concurrent access to fields of containerStore
is managed.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Should not change behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Instead of basing this on exclusivity loading via loadMut (which is AFAICS correct),
use a very traditional read-write lock to protect the fields of imageStore.
This is primarily so that all three stores use the same design.

Also explicitly document how concurrent access to fields of imageStore
is managed.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Instead of reading that value, releasing the mount lock,
and then unmounting, provide a "conditional" unmount mode.
And use that in the existing "loop unmounting" code.

That's at least safer against concurrent processes unmounting
the same layer. But the callers that try to "really unmount"
the layer in a loop are still possibly racing against other processes
trying to mount the layer in the meantime.

I'm not quite sure that we need the "conditional" parameter as an
explicit choice; it seems fairly likely that Umount() should just fail
with ErrLayerNotMounted for all !force callers. I chose to use the flag
to be conservative WRT possible unknown constraints.

Similarly, it's not very clear to me that the unmount loops need to exist;
maybe they should just be unmount(force=true, conditional=true).

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
We can't safely do that because the read-only callers don't allow us
to write to layerStore state.

Luckily, with the recent changes to Mounted, we don't really need to
reload in those places.

Also, fairly extensively document the locking design or implications
for users.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
This should be fixed, it just seems too hard to do without
breaking API (and performance).

So, just be clear about that to warn future readers.

It's tracked in containers#1379 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Should not change behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Instead of basing this on exclusivity loading via loadMut (which was incorrect,
because contrary to the original design, the r.layerspathModified
check in r.Modified() could trigger during the lifetime of a read lock)
use a very traditional read-write lock to protect the fields of imageStore.

Also explicitly document how concurrent access to fields of imageStore
is managed.

Note that for the btrfs and zfs graph drivers, Diff() can trigger
Mount() and unmount() in a way that violates the locking design.
That's not fixed in this PR.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
In that case, we can just get read locks, confirm that nothing has changed,
and continue; no need for any serialization on exclusively holding
loadMut / inProcessLock.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
In that case, we can just get read locks, confirm that nothing has changed,
and continue; no need for any serialization on exclusively holding
loadMut / inProcessLock.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
In that case, we can just get read locks, confirm that nothing has changed,
and continue; no need for any serialization on exclusively holding
loadMut / inProcessLock.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
mtrmac added a commit to mtrmac/libpod that referenced this pull request Jan 3, 2023
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
mtrmac added a commit to mtrmac/cri-o that referenced this pull request Jan 4, 2023
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Jan 5, 2023

LGTM

@rhatdan rhatdan merged commit fc91849 into containers:main Jan 5, 2023
@mtrmac mtrmac deleted the process-lock branch January 5, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants