In-process exclusion of readers and writers #1346

mtrmac · 2022-09-15T02:50:19Z

EDIT: See the description in #1346 (comment) instead.

To see what fixing #1332 would be like. This is on top of, and includes, #1345.

The goal is to protect in-process readers against concurrent writes to memory by Load. Let’s discuss the theory of that, and the performance, in #1332 , this is an attempt to draft what that could look like.

(Most of this PR could be merged, and might be useful, without actually changing the locking scheme — but without changing the locking scheme I don’t think we have a reason to do and review all of those code changes).

Right now, locking is open-coded all over the >3000 lines of storage.store:

func (s *store) ListLayerBigData(id string) ([]string, error) {
        lstore, err := s.LayerStore()
        if err != nil {
                return nil, err
        }
        lstores, err := s.ROLayerStores()
        if err != nil {
                return nil, err
        }
        foundLayer := false
        for _, s := range append([]ROLayerStore{lstore}, lstores...) {
                store := s
                store.RLock()
                defer store.Unlock()
                if err := store.ReloadIfChanged(); err != nil {
                        return nil, err
                }
                data, err := store.BigDataNames(id)
                if err == nil {
                        return data, nil
                }
                if store.Exists(id) {
                        foundLayer = true
                }
        }
        if foundLayer {
                return nil, fmt.Errorf("locating big data for layer with ID %q: %w", id, os.ErrNotExist)
        }
        return nil, fmt.Errorf("locating layer with ID %q: %w", id, ErrLayerUnknown)
}

and adding extra code to each of these sites to deal with the in-process lock would be a tedious nightmare with too many opportunities to overlook something..

So, this PR proposes consolidating almost all of these repeated snippets into helpers that call closures: layerReadAccess and layerWriteAccess at the level of a layerStore, and even higher-level (especially for reads) store.writeToLayerStore and readAllLayerStores[T](*store).

The same function can now be:

func (s *store) ListLayerBigData(id string) ([]string, error) {
	foundLayer := false
	if data, err, done := readAllLayerStores(s, func(store roLayerStore, token layerReadToken) ([]string, error, bool) {
		data, err := store.bigDataNames(token, id)
		if err == nil {
			return data, nil, true
		}
		if store.exists(token, id) {
			foundLayer = true
		}
		return nil, nil, false
	}); done {
		return data, err
	}
	if foundLayer {
		return nil, fmt.Errorf("locating big data for layer with ID %q: %w", id, os.ErrNotExist)
	}
	return nil, fmt.Errorf("locating layer with ID %q: %w", id, ErrLayerUnknown)
}

which is < 2/3 of the original size.

At least for the read side, it seems to me that any variant without Go 1.18 generics would need 2 or 3 more tedious lines at every call site. That’s not a show-stopper, but let’s discuss whether we can do it this way. (A notable downside is that the Go 1.18 requirement might not allow us to backport the locking changes to any earlier version … but then with the size of the locking changes, more than almost all of store.go, we might not want to do that anyway?)

After having the closures, we can now quite cheaply add layerReadToken and layerWriteToken types that serve as markers that the layer store’s methods are called with the correct lock held. That’s not nearly as good as a safe language-level support (making the lock-protected fields outright unreachable without holding a lock), but it’s something.

This does not actually show handling of the harder cases, like CreateContainer, where we might need to lock several stores, and the like. I’m confident that can be done — we can always do locking manually instead of through the closures; but I didn’t want to spend time on this without having basic consensus on the approach at all.

Also, there seem to be quite a few opportunities for small-scale optimization, typically to check for ErrLayerUnknown from a call instead of calling Exists before/after the main query. After a few false starts, this PR tries very hard not to touch that logic, because that would ballon the scope of this work to a completely unmanageable and impossible to review size.

See the individual commit messages for details.

I’d very much appreciate

opinions on whether this style is reasonable and acceptable overall
any suggestions to make the code cleaner. (I never thought I’d say this but I found myself thinking that this would probably be significantly easier to do, and even easier to prove correct, in C++, due to tuples, variadic templates and const.)

mtrmac · 2022-09-15T03:46:58Z

⚠️ Do not merge as is: On top of the upgrade to Go 1.18, this outright disables all of golangci-lint, because it reports so many warnings, many of them at least a bit valid, and fixing them is not the point of this PR.

mtrmac · 2022-09-15T15:38:08Z

After having the closures, we can now quite cheaply add layerReadToken and layerWriteToken types that serve as markers that the layer store’s methods are called with the correct lock held.

External callers of layerStore could probably be protected easier by being provided a “open with lock” / “operations allowed when locked for reading” / ”operations allowed when locked for writing”, making it impossible to call the wrong methods without a cast.

But that would not protect internal calls within layerStore methods, because regardless of which interface they implement, they would be methods on the underlying object.

Also, the “Token” objects, unlike interface pointers, could in principle carry state to allow diagnosing their misuse (They don’t currently, they are zero-sized structs.). Interfaces pointers are ~stateless.

mtrmac · 2022-09-15T16:44:33Z

For the record, the empty …Token structs are optimized out and cause no extra machine code to be generated, at least in toy examples.

mtrmac · 2022-10-10T21:32:18Z

Per the #1332 discussion, to see whether this locking has acceptable performance, essentially all of it has to be built first.

So, as much as I like the tokens and the extra bit of safety, that’s a bit too much churn.

So, I’ll be trying:

To reduce boilerplate in store as much as possible, at least for the simple cases — to decrease the size of the code and the number of lines that need to be touched by any other changes.
To replace {,R,Un}Lock on the C/I/L stores, + a manual ReloadIfChanged, with a higher-level {start,stop}{Reading,Writing}, explicitly so that callers don’t explicitly manage the inter-process locks, and instead express intent. That will also allow us to do cleanup in start* for Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136 — even if the intended operation is a “read” the stores will have the freedom to lock for writing (or to loop obtaining and releasing a lock, if necessary). (The token approach is nice when it matches, but many of the important write code paths would require significant restructuring of the code or the locking scheme, and that’s an explicit non-goal of this effort.)
Both of the above could be an improvement in every situation, at least decreasing code size. But we can then change what the locks do, just by modifying the very few start*/stop*/Load methods.

mtrmac · 2022-10-13T01:14:24Z

I think this is now broadly representative of where I want to get — but I realize the size is completely unmanageable.

Basically the only part worth looking at right now is 309fcfa (“PROTOTYPE: Turn loadMut into inProcessLock”) — that is a very rough sketch of the RFC part; the rest is just infrastructure.

mtrmac · 2022-10-13T01:15:07Z

Next steps are to merge #1345 , then I have a set of thematic branches lined up for PRs.

vrothberg · 2022-10-14T07:59:58Z

This is on my todo list but it's a massive PR that requires some uninterrupted time to review which is rare at the moment.

mtrmac · 2022-10-14T13:56:32Z

I’m afraid this PR grew as I was learning things… but basically all of it is trivial infrastructure to just centralize the locking decisions (and I’m filing discovered parts that aren’t infrastructure as separate PRs). I’m also filing that infrastructure as a set of smaller PRs.

The thing I’d like most feedback on is #1346 (comment) — both WRT correctness and performance. (And that’s also the thing that is least finished.)

rhatdan · 2022-10-14T17:46:28Z

@mtrmac needs a rebase.

mtrmac · 2022-12-13T19:02:47Z

Yes, I think this is ready for review (or possible CRI-O benchmarking).

rhatdan · 2022-12-14T13:55:10Z

@mrunalp @haircommander PTAL

rhatdan · 2022-12-14T19:44:02Z

LGTM
@alexlarsson @giuseppe @vrothberg @nalind @flouthoc @umohnani8 PTAL

vrothberg

Impressive work. LGTM

vrothberg · 2022-12-15T07:21:37Z

containers.go

+
+	inProcessLock sync.RWMutex // Can _only_ be obtained with lockfile held.
+	// The following fields can only be read/written with read/write ownership of inProcessLock, respectively.
+	// Almost all users should use startReading() or startWriting().


Absolutely non-blocking and just an idea. The Image struct in libimage has some fields that may be cached. I think you (@mtrmac) suggested to create an embedded struct to highlight the caching nature of these fields (see https://github.com/containers/common/blob/main/libimage/image.go#L38). Cached fields are always accessed via (*Image).cached.field which IMO makes the code much more expressive and comprehensible.

Would it help to add the fields that require locking in a similar embedded struct? (*containerStore).requiresLock.containers?

Ooh, that did not occur to me at all, and it’s definitely worth doing.

I have prototyped this, and it does seem valuable — but it’s a lot of churn.

E.g. on containers.go, the current PR is +80-14 non-whitespace non-comment changes (including some changes with comments), and moving the fields around would be about +51-45 . Also, in addition to C/I/L stores, the top-level store would benefit from the same updates, but store.go is mostly out of scope of this PR.

Overall, I’d prefer to do it in a separate PR a bit later: it would help maintenance but it should not have immediately user-visible impact, which makes it harder for me to prioritize right now — and the churn so far has already been disrupting Nalin’s work, so coordinating a bit better seems warranted.

I have added the suggestion to #1389; it might well be worth tracking as a higher-priority issue separate from the somewhat desperate #1389.

Yes, let's do that in a separate PR. Thank you!

vrothberg · 2022-12-15T07:29:27Z

layers.go

+	// calls to c/storage API.
+	//
+	// Users that need to handle concurrent mount/unmount attempts should not access this
+	// field at all, and should only user the path returned by .Mount() (and that’s only


s/only users/only use/

Fixed, thanks!

giuseppe

LGTM (after the comments already pointed out)

haircommander · 2022-12-15T15:38:40Z

I have tasked @wgahnagl with running some performance tests to see how this works. I can follow up here when we have results

wgahnagl · 2022-12-16T20:07:35Z

I have tasked @wgahnagl with running some performance tests to see how this works. I can follow up here when we have results

I've run the kubeBurner high-density stresstest against this PR, and it's maxing out at 500 pods, which is normal! 👍

mtrmac · 2022-12-16T22:13:21Z

@wgahnagl Thank you so much.

rhatdan · 2022-12-17T10:08:55Z

@mtrmac Could you fix the comments and then we can merge and release a new containers/storage.

rhatdan · 2023-01-03T13:15:43Z

@mtrmac Could you fix the comments and then we can merge and release a new containers/storage.

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Instead of basing this on exclusivity loading via loadMut (which is AFAICS correct), use a very traditional read-write lock to protect the fields of containerStore. This is primarily so that all three stores use the same design. Also explicitly document how concurrent access to fields of containerStore is managed. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Instead of basing this on exclusivity loading via loadMut (which is AFAICS correct), use a very traditional read-write lock to protect the fields of imageStore. This is primarily so that all three stores use the same design. Also explicitly document how concurrent access to fields of imageStore is managed. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Instead of reading that value, releasing the mount lock, and then unmounting, provide a "conditional" unmount mode. And use that in the existing "loop unmounting" code. That's at least safer against concurrent processes unmounting the same layer. But the callers that try to "really unmount" the layer in a loop are still possibly racing against other processes trying to mount the layer in the meantime. I'm not quite sure that we need the "conditional" parameter as an explicit choice; it seems fairly likely that Umount() should just fail with ErrLayerNotMounted for all !force callers. I chose to use the flag to be conservative WRT possible unknown constraints. Similarly, it's not very clear to me that the unmount loops need to exist; maybe they should just be unmount(force=true, conditional=true). Signed-off-by: Miloslav Trmač <mitr@redhat.com>

We can't safely do that because the read-only callers don't allow us to write to layerStore state. Luckily, with the recent changes to Mounted, we don't really need to reload in those places. Also, fairly extensively document the locking design or implications for users. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

This should be fixed, it just seems too hard to do without breaking API (and performance). So, just be clear about that to warn future readers. It's tracked in containers#1379 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Instead of basing this on exclusivity loading via loadMut (which was incorrect, because contrary to the original design, the r.layerspathModified check in r.Modified() could trigger during the lifetime of a read lock) use a very traditional read-write lock to protect the fields of imageStore. Also explicitly document how concurrent access to fields of imageStore is managed. Note that for the btrfs and zfs graph drivers, Diff() can trigger Mount() and unmount() in a way that violates the locking design. That's not fixed in this PR. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

In that case, we can just get read locks, confirm that nothing has changed, and continue; no need for any serialization on exclusively holding loadMut / inProcessLock. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Signed-off-by: Miloslav Trmač <mitr@redhat.com>

vrothberg

LGTM

rhatdan · 2023-01-05T10:55:21Z

LGTM

mtrmac force-pushed the process-lock branch from 971d467 to 9bd681b Compare September 15, 2022 02:51

This was referenced Sep 15, 2022

Make {container, layer, image} store APIs private #1345

Merged

layerStore.Load under layerStore.RLock is racy #1332

Closed

mtrmac force-pushed the process-lock branch 2 times, most recently from 83527ff to 41d56e2 Compare September 15, 2022 03:45

mtrmac marked this pull request as draft September 15, 2022 03:46

mtrmac mentioned this pull request Sep 29, 2022

Fix accessing containers and image locks #1369

Merged

mtrmac force-pushed the process-lock branch 2 times, most recently from f18a6d6 to b618ff3 Compare October 7, 2022 18:04

mtrmac force-pushed the process-lock branch 2 times, most recently from a5e7ef9 to 988704d Compare October 10, 2022 19:56

mtrmac force-pushed the process-lock branch 4 times, most recently from 3a9c14a to 309fcfa Compare October 13, 2022 00:57

mtrmac force-pushed the process-lock branch 2 times, most recently from 744008a to 2025168 Compare October 13, 2022 22:07

mtrmac force-pushed the process-lock branch from 2025168 to 5a0b30a Compare October 14, 2022 15:37

mtrmac force-pushed the process-lock branch 3 times, most recently from e19014a to e4ea7dd Compare October 18, 2022 21:58

vrothberg reviewed Dec 15, 2022

View reviewed changes

giuseppe approved these changes Dec 15, 2022

View reviewed changes

mtrmac added 12 commits January 3, 2023 23:47

Annotate containerStore methods with the required lock access.

4a043b5

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Annotate imageStore methods with the required lock access

3bb4c85

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Don't fix, just document, a locking bug

4950eef

This should be fixed, it just seems too hard to do without breaking API (and performance). So, just be clear about that to warn future readers. It's tracked in containers#1379 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Annotate layerStore methods with the required lock access

d173b02

Should not change behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Optimize the "no changes" case of containerStore.startReading further

24d3ecb

In that case, we can just get read locks, confirm that nothing has changed, and continue; no need for any serialization on exclusively holding loadMut / inProcessLock. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Optimize the "no changes" case of imageStore.startReading further

8ce7409

In that case, we can just get read locks, confirm that nothing has changed, and continue; no need for any serialization on exclusively holding loadMut / inProcessLock. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Optimize the "no changes" case of layerStore.startReading further

ffeaed8

In that case, we can just get read locks, confirm that nothing has changed, and continue; no need for any serialization on exclusively holding loadMut / inProcessLock. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac force-pushed the process-lock branch from 24bff1d to ffeaed8 Compare January 3, 2023 23:31

mtrmac added a commit to mtrmac/libpod that referenced this pull request Jan 3, 2023

DO NOT MERGE: Test containers/storage#1346

05dfcec

Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac added a commit to mtrmac/cri-o that referenced this pull request Jan 4, 2023

DO NOT MERGE: Test containers/storage#1346

d480476

Signed-off-by: Miloslav Trmač <mitr@redhat.com>

vrothberg reviewed Jan 4, 2023

View reviewed changes

Merge branch 'main' into process-lock

eb7a56c

rhatdan merged commit fc91849 into containers:main Jan 5, 2023

mtrmac deleted the process-lock branch January 5, 2023 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-process exclusion of readers and writers #1346

In-process exclusion of readers and writers #1346

mtrmac commented Sep 15, 2022 •

edited

mtrmac commented Sep 15, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Oct 10, 2022

mtrmac commented Oct 13, 2022

mtrmac commented Oct 13, 2022

vrothberg commented Oct 14, 2022

mtrmac commented Oct 14, 2022

rhatdan commented Oct 14, 2022

mtrmac commented Dec 13, 2022

rhatdan commented Dec 14, 2022

rhatdan commented Dec 14, 2022

vrothberg left a comment

vrothberg Dec 15, 2022

mtrmac Dec 15, 2022

mtrmac Jan 3, 2023 •

edited

vrothberg Jan 4, 2023

vrothberg Dec 15, 2022

mtrmac Jan 3, 2023

giuseppe left a comment •

edited

haircommander commented Dec 15, 2022

wgahnagl commented Dec 16, 2022 •

edited

mtrmac commented Dec 16, 2022

rhatdan commented Dec 17, 2022

rhatdan commented Jan 3, 2023

vrothberg left a comment

rhatdan commented Jan 5, 2023

In-process exclusion of readers and writers #1346

In-process exclusion of readers and writers #1346

Conversation

mtrmac commented Sep 15, 2022 • edited

mtrmac commented Sep 15, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Oct 10, 2022

mtrmac commented Oct 13, 2022

mtrmac commented Oct 13, 2022

vrothberg commented Oct 14, 2022

mtrmac commented Oct 14, 2022

rhatdan commented Oct 14, 2022

mtrmac commented Dec 13, 2022

rhatdan commented Dec 14, 2022

rhatdan commented Dec 14, 2022

vrothberg left a comment

Choose a reason for hiding this comment

vrothberg Dec 15, 2022

Choose a reason for hiding this comment

mtrmac Dec 15, 2022

Choose a reason for hiding this comment

mtrmac Jan 3, 2023 • edited

Choose a reason for hiding this comment

vrothberg Jan 4, 2023

Choose a reason for hiding this comment

vrothberg Dec 15, 2022

Choose a reason for hiding this comment

mtrmac Jan 3, 2023

Choose a reason for hiding this comment

giuseppe left a comment • edited

Choose a reason for hiding this comment

haircommander commented Dec 15, 2022

wgahnagl commented Dec 16, 2022 • edited

mtrmac commented Dec 16, 2022

rhatdan commented Dec 17, 2022

rhatdan commented Jan 3, 2023

vrothberg left a comment

Choose a reason for hiding this comment

rhatdan commented Jan 5, 2023

mtrmac commented Sep 15, 2022 •

edited

mtrmac Jan 3, 2023 •

edited

giuseppe left a comment •

edited

wgahnagl commented Dec 16, 2022 •

edited