Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform file sync outside of lock on Commit #10128

Merged
merged 1 commit into from May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions core/content/content.go
Expand Up @@ -165,6 +165,11 @@ type Writer interface {
Truncate(size int64) error
}

type Syncer interface {
// Sync flushes the in-flight writes to the disk (when applicable)
Sync() error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface should not change, Sync() error can either be added to a new interface here or defined next to where it needs to be called. For new and optional functions, this should just use a type assert and called if available. In this case only the filesystem writer would need a Sync function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we need to call namespacedWriter.w.fp.Sync() from the namedspacedWriter.Commit() to actually flush the writes. But these are all private members of the class hiding behind the interface, so I don't see a clean way to do it without modifying the interface. Can you elaborate how to expose namespacedWriter.w.fp.Sync() just for the filesystem writer (i.e., the local.writer class itself is private to namespacedWriter)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you suggest to check the actual type (behind the interface) dynamically and then call the Sync function when it is a filesystem writer?

Yes

Somewhere during namespacedWriter Commit before the transaction you could have something like

if syncer, ok := nw.w.(Syncer); ok {
  syncer.Sync()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I will test the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works. Thanks for reviewing the PR.

}

// Opt is used to alter the mutable properties of content
type Opt func(*Info) error

Expand Down
14 changes: 14 additions & 0 deletions core/metadata/content.go
Expand Up @@ -574,6 +574,13 @@ func (nw *namespacedWriter) Commit(ctx context.Context, size int64, expected dig

var innerErr error

// We pre-sync the in-flight writes to the disk. This avoids the [subsequent fp.Sync() call]
// (https://github.com/containerd/containerd/blob/c4c3c6ea568ce0cfbcf754863abadeea37d77c8f/plugins/content/local/writer.go#L95)
// from taking too long (10s+) while holding the metadata database lock as in the following
// `update` transaction. We intentionally ignore any error on Sync() because it will be
// handled by the subsequent `fp.Sync` anyway.
nw.Sync()
Copy link
Member

@fuweid fuweid Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea looks good. However, as far as I know, fsync or fdatasync could commit unrelated dirty page.
When you do it outside, it could impact other sync syscall, like NewContainer, because bbolt always need to call fdatasync for commit. Since the content store is using the same filesystem with metadata, if it takes 10 seconds, metadata could take longer as well. If you can provide some performance test result, it would be better. Thanks

REF: https://lwn.net/Articles/842385/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid Thanks for reviewing the PR!

When you do it outside, it could impact other sync syscall

This applies regardless of where we do the sync, right? The downside of doing the sync under an exclusive transaction/lock is that it blocks concurrent operations that don't demand heavy I/O like CreateContainer.

because bbolt always need to call fdatasync for commit

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

If you can provide some performance test result, it would be better.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment.

This applies regardless of where we do the sync, right?

Yes. I was saying that the CreateContainer can commit part of dirty pages created by content writer.
It's random based on the ext4 fast commit article.

That fdatasync call is on the metadata database, not on the image layers (which can be gigabytes of data). So the degree of lock contention is dramatically different.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

In our production environment, we pull and start around 10 containers upon VM booting. We observed a high variance in CreateContainer gRPCs, and sometimes a container creation can be blocked by over a minute. With this change, the latency of CreateContainer becomes both stable and negligible (e.g., in a second or two).

That's why I want to see that improvement result. It will be more confident to get this commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. We've seen a substantial improvement on the CreateContainer latency in our setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please checkout filesystem behaviour I mentioned in the REF. Both metadata database and image layers are in the same block device by default. They can impact each other.

@fuweid My understanding is that even if this exists today, we are still better off moving the sync outside of the lock which is almost guaranteed to be blocking should there be a large image layer (e.g., GB+). And there's also a chance the interference issue you referenced will be addressed by the linux kernel community in the future, and then we will enjoy a free ride here :)


if err := update(ctx, nw.db, func(tx *bolt.Tx) error {
dgst, err := nw.commit(ctx, tx, size, expected, opts...)
if err != nil {
Expand All @@ -599,6 +606,13 @@ func (nw *namespacedWriter) Commit(ctx context.Context, size int64, expected dig
return innerErr
}

func (nw *namespacedWriter) Sync() error {
if syncer, ok := nw.w.(content.Syncer); ok {
return syncer.Sync()
}
return nil
}

func (nw *namespacedWriter) commit(ctx context.Context, tx *bolt.Tx, size int64, expected digest.Digest, opts ...content.Opt) (digest.Digest, error) {
var base content.Info
for _, opt := range opts {
Expand Down
8 changes: 8 additions & 0 deletions plugins/content/local/writer.go
Expand Up @@ -206,3 +206,11 @@ func (w *writer) Truncate(size int64) error {
}
return w.fp.Truncate(0)
}

func (w *writer) Sync() error {
if w.fp != nil {
return w.fp.Sync()
}

return nil
}