Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform file sync outside of lock on Commit #10128

Merged
merged 1 commit into from May 1, 2024

Conversation

xinyangge-db
Copy link
Contributor

This PR mitigates a lock contention where PullImage holds the metadata lock while flushing in-flight writes to the disk. This can block concurrent CreateContainer and PullImage calls for an extended period of time.

@k8s-ci-robot
Copy link

Hi @xinyangge-db. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xinyangge-db
Copy link
Contributor Author

@dmcgowan Do you mind taking a look? I saw you touched similar files several years ago.

@xinyangge-db xinyangge-db force-pushed the lockless_sync branch 2 times, most recently from cf2d7cc to b97946e Compare April 24, 2024 21:29
@AkihiroSuda
Copy link
Member

/ok-to-test

@@ -163,6 +163,9 @@ type Writer interface {

// Truncate updates the size of the target blob
Truncate(size int64) error

// Sync flushes the in-flight writes to the disk (when applicable)
Sync() error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface should not change, Sync() error can either be added to a new interface here or defined next to where it needs to be called. For new and optional functions, this should just use a type assert and called if available. In this case only the filesystem writer would need a Sync function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we need to call namespacedWriter.w.fp.Sync() from the namedspacedWriter.Commit() to actually flush the writes. But these are all private members of the class hiding behind the interface, so I don't see a clean way to do it without modifying the interface. Can you elaborate how to expose namespacedWriter.w.fp.Sync() just for the filesystem writer (i.e., the local.writer class itself is private to namespacedWriter)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you suggest to check the actual type (behind the interface) dynamically and then call the Sync function when it is a filesystem writer?

Yes

Somewhere during namespacedWriter Commit before the transaction you could have something like

if syncer, ok := nw.w.(Syncer); ok {
  syncer.Sync()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I will test the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works. Thanks for reviewing the PR.

// from taking too long (10s+) while holding the metadata database lock as in the following
// `update` transaction. We intentionally ignore any error on Sync() because it will be
// handled by the subsequent `fp.Sync` anyway.
nw.Sync()
Copy link
Member

@fuweid fuweid Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good. However, as far as I know, fsync or fdatasync could commit unrelated dirty page.
When you do it outside, it could impact other sync syscall, like NewContainer, because bbolt always need to call fdatasync for commit. Since the content store is using the same filesystem with metadata, if it takes 10 seconds, metadata could take longer as well. If you can provide some performance test result, it would be better. Thanks

REF: https://lwn.net/Articles/842385/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid Thanks for reviewing the PR!

When you do it outside, it could impact other sync syscall

This applies regardless of where we do the sync, right? The downside of doing the sync under an exclusive transaction/lock is that it blocks concurrent operations that don't demand heavy I/O like CreateContainer.

because bbolt always need to call fdatasync for commit

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

If you can provide some performance test result, it would be better.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.

This applies regardless of where we do the sync, right?

Yes. I was saying that the CreateContainer can commit part of dirty pages created by content writer.
It's random based on the ext4 fast commit article.

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

That's why I want to see that improvement result. It will be more confident to get this commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We've seen a substantial improvement on the CreateContainer latency in our setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

@fuweid My understanding is that even if this exists today, we are still better off moving the sync outside of the lock which is almost guaranteed to be blocking should there be a large image layer (e.g., GB+). And there's also a chance the interference issue you referenced will be addressed by the linux kernel community in the future, and then we will enjoy a free ride here :)

@fuweid
Copy link
Member

fuweid commented Apr 26, 2024

f0a06a8 "Address Derek's comments" ... FAIL
- PASS - commit does not have any whitespace errors
- FAIL - does not have a valid DCO
- PASS - commit subject is 72 characters or less! yay

Please squash the commits and sign it off. Thanks.
The change looks good to me.

Signed-off-by: Xinyang Ge <xinyang.ge@databricks.com>
@xinyangge-db
Copy link
Contributor Author

f0a06a8 "Address Derek's comments" ... FAIL - PASS - commit does not have any whitespace errors - FAIL - does not have a valid DCO - PASS - commit subject is 72 characters or less! yay

Please squash the commits and sign it off. Thanks. The change looks good to me.

@fuweid Done.

@xinyangge-db
Copy link
Contributor Author

@fuweid @dmcgowan Could you kindly take another look and let me know if there are other issues to address in the PR?

@mxpv mxpv added this pull request to the merge queue May 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024
@mxpv mxpv added this pull request to the merge queue May 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024
@mxpv mxpv added this pull request to the merge queue May 1, 2024
Merged via the queue into containerd:main with commit 2ec82c4 May 1, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants