New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform file sync outside of lock on Commit #10128
Conversation
Hi @xinyangge-db. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@dmcgowan Do you mind taking a look? I saw you touched similar files several years ago. |
cf2d7cc
to
b97946e
Compare
/ok-to-test |
@@ -163,6 +163,9 @@ type Writer interface { | |||
|
|||
// Truncate updates the size of the target blob | |||
Truncate(size int64) error | |||
|
|||
// Sync flushes the in-flight writes to the disk (when applicable) | |||
Sync() error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface should not change, Sync() error
can either be added to a new interface here or defined next to where it needs to be called. For new and optional functions, this should just use a type assert and called if available. In this case only the filesystem writer would need a Sync function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, we need to call namespacedWriter.w.fp.Sync()
from the namedspacedWriter.Commit()
to actually flush the writes. But these are all private members of the class hiding behind the interface, so I don't see a clean way to do it without modifying the interface. Can you elaborate how to expose namespacedWriter.w.fp.Sync()
just for the filesystem writer (i.e., the local.writer
class itself is private to namespacedWriter
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, do you suggest to check the actual type (behind the interface) dynamically and then call the Sync function when it is a filesystem writer?
Yes
Somewhere during namespacedWriter Commit before the transaction you could have something like
if syncer, ok := nw.w.(Syncer); ok {
syncer.Sync()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I will test the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and it works. Thanks for reviewing the PR.
// from taking too long (10s+) while holding the metadata database lock as in the following | ||
// `update` transaction. We intentionally ignore any error on Sync() because it will be | ||
// handled by the subsequent `fp.Sync` anyway. | ||
nw.Sync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea looks good. However, as far as I know, fsync or fdatasync could commit unrelated dirty page.
When you do it outside, it could impact other sync syscall, like NewContainer, because bbolt always need to call fdatasync for commit. Since the content store is using the same filesystem with metadata, if it takes 10 seconds, metadata could take longer as well. If you can provide some performance test result, it would be better. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuweid Thanks for reviewing the PR!
When you do it outside, it could impact other sync syscall
This applies regardless of where we do the sync, right? The downside of doing the sync under an exclusive transaction/lock is that it blocks concurrent operations that don't demand heavy I/O like CreateContainer
.
because bbolt always need to call fdatasync for commit
That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.
If you can provide some performance test result, it would be better.
In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer
gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer
becomes both stable and negligible (e.g., in a second or two).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment.
This applies regardless of where we do the sync, right?
Yes. I was saying that the CreateContainer can commit part of dirty pages created by content writer.
It's random based on the ext4 fast commit article.
That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.
Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.
In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).
That's why I want to see that improvement result. It will be more confident to get this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. We've seen a substantial improvement on the CreateContainer
latency in our setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.
@fuweid My understanding is that even if this exists today, we are still better off moving the sync outside of the lock which is almost guaranteed to be blocking should there be a large image layer (e.g., GB+). And there's also a chance the interference issue you referenced will be addressed by the linux kernel community in the future, and then we will enjoy a free ride here :)
f0a06a8 "Address Derek's comments" ... FAIL Please squash the commits and sign it off. Thanks. |
Signed-off-by: Xinyang Ge <xinyang.ge@databricks.com>
f0a06a8
to
4167416
Compare
This PR mitigates a lock contention where
PullImage
holds the metadata lock while flushing in-flight writes to the disk. This can block concurrentCreateContainer
andPullImage
calls for an extended period of time.