Perform file sync outside of lock on Commit #10128

xinyangge-db · 2024-04-24T21:10:33Z

This PR mitigates a lock contention where PullImage holds the metadata lock while flushing in-flight writes to the disk. This can block concurrent CreateContainer and PullImage calls for an extended period of time.

k8s-ci-robot · 2024-04-24T21:10:43Z

Hi @xinyangge-db. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xinyangge-db · 2024-04-24T21:15:16Z

@dmcgowan Do you mind taking a look? I saw you touched similar files several years ago.

AkihiroSuda · 2024-04-24T23:10:17Z

/ok-to-test

dmcgowan · 2024-04-25T13:37:28Z

core/content/content.go

@@ -163,6 +163,9 @@ type Writer interface {

 	// Truncate updates the size of the target blob
 	Truncate(size int64) error
+
+	// Sync flushes the in-flight writes to the disk (when applicable)
+	Sync() error


This interface should not change, Sync() error can either be added to a new interface here or defined next to where it needs to be called. For new and optional functions, this should just use a type assert and called if available. In this case only the filesystem writer would need a Sync function.

Eventually, we need to call namespacedWriter.w.fp.Sync() from the namedspacedWriter.Commit() to actually flush the writes. But these are all private members of the class hiding behind the interface, so I don't see a clean way to do it without modifying the interface. Can you elaborate how to expose namespacedWriter.w.fp.Sync() just for the filesystem writer (i.e., the local.writer class itself is private to namespacedWriter)?

Oh, do you suggest to check the actual type (behind the interface) dynamically and then call the Sync function when it is a filesystem writer?

Yes

Somewhere during namespacedWriter Commit before the transaction you could have something like

if syncer, ok := nw.w.(Syncer); ok { syncer.Sync() }

Done. I will test the change.

Tested and it works. Thanks for reviewing the PR.

fuweid · 2024-04-25T16:18:45Z

core/metadata/content.go

+	// from taking too long (10s+) while holding the metadata database lock as in the following
+	// `update` transaction.  We intentionally ignore any error on Sync() because it will be
+	// handled by the subsequent `fp.Sync` anyway.
+	nw.Sync()


The idea looks good. However, as far as I know, fsync or fdatasync could commit unrelated dirty page.
When you do it outside, it could impact other sync syscall, like NewContainer, because bbolt always need to call fdatasync for commit. Since the content store is using the same filesystem with metadata, if it takes 10 seconds, metadata could take longer as well. If you can provide some performance test result, it would be better. Thanks

REF: https://lwn.net/Articles/842385/

@fuweid Thanks for reviewing the PR!

When you do it outside, it could impact other sync syscall

This applies regardless of where we do the sync, right? The downside of doing the sync under an exclusive transaction/lock is that it blocks concurrent operations that don't demand heavy I/O like CreateContainer.

because bbolt always need to call fdatasync for commit

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

If you can provide some performance test result, it would be better.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

Thanks for the comment.

This applies regardless of where we do the sync, right?

Yes. I was saying that the CreateContainer can commit part of dirty pages created by content writer.
It's random based on the ext4 fast commit article.

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

That's why I want to see that improvement result. It will be more confident to get this commit.

Makes sense. We've seen a substantial improvement on the CreateContainer latency in our setup.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

@fuweid My understanding is that even if this exists today, we are still better off moving the sync outside of the lock which is almost guaranteed to be blocking should there be a large image layer (e.g., GB+). And there's also a chance the interference issue you referenced will be addressed by the linux kernel community in the future, and then we will enjoy a free ride here :)

fuweid · 2024-04-26T05:03:01Z

f0a06a8 "Address Derek's comments" ... FAIL
- PASS - commit does not have any whitespace errors
- FAIL - does not have a valid DCO
- PASS - commit subject is 72 characters or less! yay

Please squash the commits and sign it off. Thanks.
The change looks good to me.

Signed-off-by: Xinyang Ge <xinyang.ge@databricks.com>

xinyangge-db · 2024-04-26T12:43:07Z

f0a06a8 "Address Derek's comments" ... FAIL - PASS - commit does not have any whitespace errors - FAIL - does not have a valid DCO - PASS - commit subject is 72 characters or less! yay

Please squash the commits and sign it off. Thanks. The change looks good to me.

@fuweid Done.

xinyangge-db · 2024-04-30T08:34:03Z

@fuweid @dmcgowan Could you kindly take another look and let me know if there are other issues to address in the PR?

k8s-ci-robot added needs-ok-to-test size/M labels Apr 24, 2024

xinyangge-db force-pushed the lockless_sync branch 2 times, most recently from cf2d7cc to b97946e Compare April 24, 2024 21:29

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Apr 24, 2024

dmcgowan reviewed Apr 25, 2024

View reviewed changes

fuweid reviewed Apr 25, 2024

View reviewed changes

k8s-ci-robot added size/S and removed size/M labels Apr 25, 2024

Perform file sync outside of lock on Commit

4167416

Signed-off-by: Xinyang Ge <xinyang.ge@databricks.com>

xinyangge-db force-pushed the lockless_sync branch from f0a06a8 to 4167416 Compare April 26, 2024 12:42

xinyangge-db requested review from fuweid and dmcgowan April 28, 2024 23:25

AkihiroSuda approved these changes Apr 30, 2024

View reviewed changes

mxpv approved these changes May 1, 2024

View reviewed changes

mxpv added this pull request to the merge queue May 1, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024

mxpv added this pull request to the merge queue May 1, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024

mxpv added this pull request to the merge queue May 1, 2024

Merged via the queue into containerd:main with commit 2ec82c4 May 1, 2024
47 checks passed

fuweid mentioned this pull request May 2, 2024

RFC - [content-store] Commit should check first fsync() error #10158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform file sync outside of lock on Commit #10128

Perform file sync outside of lock on Commit #10128

xinyangge-db commented Apr 24, 2024

k8s-ci-robot commented Apr 24, 2024

xinyangge-db commented Apr 24, 2024

AkihiroSuda commented Apr 24, 2024

dmcgowan Apr 25, 2024

xinyangge-db Apr 25, 2024 •

edited

dmcgowan Apr 25, 2024

xinyangge-db Apr 25, 2024

xinyangge-db Apr 25, 2024

fuweid Apr 25, 2024 •

edited

xinyangge-db Apr 25, 2024 •

edited

fuweid Apr 26, 2024

xinyangge-db Apr 26, 2024

xinyangge-db Apr 26, 2024 •

edited

fuweid commented Apr 26, 2024

xinyangge-db commented Apr 26, 2024

xinyangge-db commented Apr 30, 2024

Perform file sync outside of lock on Commit #10128

Perform file sync outside of lock on Commit #10128

Conversation

xinyangge-db commented Apr 24, 2024

k8s-ci-robot commented Apr 24, 2024

xinyangge-db commented Apr 24, 2024

AkihiroSuda commented Apr 24, 2024

dmcgowan Apr 25, 2024

Choose a reason for hiding this comment

xinyangge-db Apr 25, 2024 • edited

Choose a reason for hiding this comment

dmcgowan Apr 25, 2024

Choose a reason for hiding this comment

xinyangge-db Apr 25, 2024

Choose a reason for hiding this comment

xinyangge-db Apr 25, 2024

Choose a reason for hiding this comment

fuweid Apr 25, 2024 • edited

Choose a reason for hiding this comment

xinyangge-db Apr 25, 2024 • edited

Choose a reason for hiding this comment

fuweid Apr 26, 2024

Choose a reason for hiding this comment

xinyangge-db Apr 26, 2024

Choose a reason for hiding this comment

xinyangge-db Apr 26, 2024 • edited

Choose a reason for hiding this comment

fuweid commented Apr 26, 2024

xinyangge-db commented Apr 26, 2024

xinyangge-db commented Apr 30, 2024

xinyangge-db Apr 25, 2024 •

edited

fuweid Apr 25, 2024 •

edited

xinyangge-db Apr 25, 2024 •

edited

xinyangge-db Apr 26, 2024 •

edited