-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for #22951 #22953
Fix for #22951 #22953
Changes from 3 commits
2eacbd4
ff6b5a0
5f79ed6
bcd4ba9
040b744
b1b732c
70a92b6
6abe4cd
3e53b0f
f00c85d
5840e55
6d016f1
9f874f7
5a3ee7f
fa1bd88
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -113,6 +113,9 @@ class BatchLoads<DestinationT, ElementT> | |||
// If user triggering is supplied, we will trigger the file write after this many records are | ||||
// written. | ||||
static final int FILE_TRIGGERING_RECORD_COUNT = 500000; | ||||
// If user triggering is supplied, we will trigger the file write after this many bytes are | ||||
// written. | ||||
static final long FILE_TRIGGERING_BYTE_COUNT = 100 * (1L << 20); // 100MiB | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like we already have a memory limit for writing of 20 parallel writers with 64mb buffers. Should we limit this triggering to be 64mbs as well so that it fits in one chunk? CC: @reuvenlax Any suggestions here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lukecwik Having the same limit as a buffer actually makes sense to me, but can you direct me towards where might I find that limit? I can see it in the comments for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default comes from this constant: The user can override the default using this pipeline option: Line 91 in b2a6f46
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lukecwik So either the triggering byte count should be x% smaller than the 64MB default, or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, didn't see this comment but I agree that we should change GroupIntoBatches to ensure that if we add an element that would make it go over the limit we would flush a batch. Pseudo-code would be like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I used a different algo, but IMO it stays closed to the original concept of the transform now. |
||||
|
||||
// If using auto-sharding for unbounded data, we batch the records before triggering file write | ||||
// to avoid generating too many small files. | ||||
|
@@ -647,6 +650,7 @@ PCollection<WriteBundlesToFiles.Result<DestinationT>> writeDynamicallyShardedFil | |||
return input | ||||
.apply( | ||||
GroupIntoBatches.<DestinationT, ElementT>ofSize(FILE_TRIGGERING_RECORD_COUNT) | ||||
.withByteSize(FILE_TRIGGERING_BYTE_COUNT) | ||||
.withMaxBufferingDuration(maxBufferingDuration) | ||||
.withShardedKey()) | ||||
.setCoder( | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like your adding support for GroupIntoBatches to limit on count and byte size at the same time.
Can you add tests that cover this new scenario to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see
040b744
(#22953) andb1b732c
(#22953)