New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for #22951 #22953
Fix for #22951 #22953
Conversation
R: @lukecwik |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
fyi, most than likely it requires spotless, used github.dev for it |
irrelevant test failure with "Java Tests / Java Wordcount Direct Runner (windows-latest) (pull_request)" |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
@@ -113,6 +113,9 @@ | |||
// If user triggering is supplied, we will trigger the file write after this many records are | |||
// written. | |||
static final int FILE_TRIGGERING_RECORD_COUNT = 500000; | |||
// If user triggering is supplied, we will trigger the file write after this many bytes are | |||
// written. | |||
static final long FILE_TRIGGERING_BYTE_COUNT = 100 * (1L << 20); // 100MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we already have a memory limit for writing of 20 parallel writers with 64mb buffers. Should we limit this triggering to be 64mbs as well so that it fits in one chunk?
CC: @reuvenlax Any suggestions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukecwik Having the same limit as a buffer actually makes sense to me, but can you direct me towards where might I find that limit? I can see it in the comments for DEFAULT_MAX_NUM_WRITERS_PER_BUNDLE
, but instead of hardcoding 64MB here as well, I would rather reference the original limit directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default comes from this constant:
https://www.javadoc.io/static/com.google.cloud.bigdataoss/util/1.9.17/com/google/cloud/hadoop/util/AsyncWriteChannelOptions.html#UPLOAD_CHUNK_SIZE_DEFAULT
The user can override the default using this pipeline option:
Line 91 in b2a6f46
Integer getGcsUploadBufferSizeBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in bcd4ba9
(#22953)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukecwik
On second thought, I think there is a problem with using this 64MB default. We only flush the batch inside GroupIntoBatches
, once the storedBatchSizeBytes
is greater than or equal to the limit. So if we make the limit 64MB, more than likely we will flush just a bit more than 64MB so we won't fit into the 64MB buffer.
So either the triggering byte count should be x% smaller than the 64MB default, or GroupIntoBatches
has to be modified that it the current element would make it go over the byte size limit, then fire the batch without that element being added to the it first. The second seems like a better solution, but I assume doing the storedBatchSizeBytes.read()
sooner would have performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't see this comment but I agree that we should change GroupIntoBatches to ensure that if we add an element that would make it go over the limit we would flush a batch.
Pseudo-code would be like:
byteSize = measure(obj)
if (byteSize >= byteSizeLimit) {
output obj as single element batch
continue
}
if (byteSize + previousNumBytes > byteSizeLimit) {
output all buffered elements as batch
} else {
add obj to batch
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a different algo, but IMO it stays closed to the original concept of the transform now.
@@ -117,6 +117,11 @@ | |||
*/ | |||
@AutoValue | |||
public abstract static class BatchingParams<InputT> implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like your adding support for GroupIntoBatches to limit on count and byte size at the same time.
Can you add tests that cover this new scenario to:
- GroupIntoBatchesTest
- GroupIntoBatchesTranslationTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 040b744
(#22953) and b1b732c
(#22953)
Codecov Report
@@ Coverage Diff @@
## master #22953 +/- ##
==========================================
+ Coverage 73.45% 73.59% +0.14%
==========================================
Files 714 716 +2
Lines 96497 95282 -1215
==========================================
- Hits 70886 70127 -759
+ Misses 24289 23859 -430
+ Partials 1322 1296 -26
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
||
private boolean checkBatchSizes(Iterable<KV<String, Iterable<String>>> listToCheck) { | ||
for (KV<String, Iterable<String>> element : listToCheck) { | ||
if (Iterables.size(element.getValue()) != BATCH_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did notice that it's !=
and not >
here, but the test is still valid with >
(we have 10 elements, and 5 batch size, so it can't be anything but 5, and we check the batch count at the end with EVEN_NUM_ELEMENTS / BATCH_SIZE
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this previous test was too strict and your update makes sense. GroupIntoBatches ensures that the batches aren't bigger than BATCH_SIZE elements.
Unfortunately I think the GroupIntoBatches specification is too loose since it uses words like Aim to create batches
. It would be great if we could make it a strict guarantee, for example batches will never be bigger then element count, or that they will never be bigger then byte size (except for the case where a single element is bigger then byte size and it will show up in its own group). I wouldn't try to solve this here but it would make sense to have a bug/and or follow-up PR to make this explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I did solved that. I mean apart from the inaccuracy of the weigher.
…chesTranslationTest
… introduced test in GroupIntoBatchesTest
} | ||
|
||
// fire them all at once | ||
TestStream<KV<String, String>> stream = streamBuilder.advanceWatermarkToInfinity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukecwik is there any other (so not TestStream+TimestampedValue) simpler way to guarantee the order of the elements that I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using TestStream is a way to ensure that the output is produced and processed in a specific order ensuring exact output conditions. It makes it easier to write pipeline level integration tests for exact scenarios.
As the other tests have done the other option is to use PAssert with a custom matcher that passes for any valid combination of outputs. It is difficult to have a meaningful test for cases where runner determined re-ordering can produce lots of different combinations of output that are valid. Typically one just writes a check to make sure that certain properties are satisfied. For GroupIntoBatches with both limits you would ensure that if you take out the largest element from the group then the byte size is less than the limit and also ensure that the group into batches element count is never greater than the specified element count.
These properties would ensure that the GroupIntoBatches was honoring the contract but a naive implementation could choose to group inefficiently (e.g. each batch is one element) which would still be valid.
Run Java PreCommit |
2 similar comments
Run Java PreCommit |
Run Java PreCommit |
Note to self: the same happens with https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java |
@lukecwik Can I use the same PipelineOptions there (in WriteFiles) as well? Does it use the same network layer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long wait.
} | ||
|
||
// fire them all at once | ||
TestStream<KV<String, String>> stream = streamBuilder.advanceWatermarkToInfinity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using TestStream is a way to ensure that the output is produced and processed in a specific order ensuring exact output conditions. It makes it easier to write pipeline level integration tests for exact scenarios.
As the other tests have done the other option is to use PAssert with a custom matcher that passes for any valid combination of outputs. It is difficult to have a meaningful test for cases where runner determined re-ordering can produce lots of different combinations of output that are valid. Typically one just writes a check to make sure that certain properties are satisfied. For GroupIntoBatches with both limits you would ensure that if you take out the largest element from the group then the byte size is less than the limit and also ensure that the group into batches element count is never greater than the specified element count.
These properties would ensure that the GroupIntoBatches was honoring the contract but a naive implementation could choose to group inefficiently (e.g. each batch is one element) which would still be valid.
|
||
private boolean checkBatchSizes(Iterable<KV<String, Iterable<String>>> listToCheck) { | ||
for (KV<String, Iterable<String>> element : listToCheck) { | ||
if (Iterables.size(element.getValue()) != BATCH_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this previous test was too strict and your update makes sense. GroupIntoBatches ensures that the batches aren't bigger than BATCH_SIZE elements.
Unfortunately I think the GroupIntoBatches specification is too loose since it uses words like Aim to create batches
. It would be great if we could make it a strict guarantee, for example batches will never be bigger then element count, or that they will never be bigger then byte size (except for the case where a single element is bigger then byte size and it will show up in its own group). I wouldn't try to solve this here but it would make sense to have a bug/and or follow-up PR to make this explicit.
@@ -267,20 +225,9 @@ public void testWithShardedKeyInGlobalWindow() { | |||
PAssert.that("Incorrect batch size in one or more elements", collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move the comment just above into checkBatchSizes
:
// Since with default sharding, the number of subshards of a key is nondeterministic, create
// a large number of input elements and a small batch size and check there is no batch larger
// than the specified size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erhm, isn't this comment only valid for .withShardedKey()
?
}) | ||
public void testMultipleLimitsAtOnceInGlobalWindowBatchSizeCountAndBatchSizeByteSize() { | ||
// with using only one of the limits the result would be only 2 batches, | ||
// if we have 3 both limit works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if we have 3 both limit works | |
// if we have 3 both limits are exercised |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 6abe4cd
(#22953)
.map(s -> KV.of("key", s)) | ||
.collect(Collectors.toList()); | ||
|
||
// to ensure ordered firing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// to ensure ordered firing | |
// to ensure ordered processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 6abe4cd
(#22953)
.advanceWatermarkTo(Instant.EPOCH); | ||
|
||
long offset = 0L; | ||
for (KV<String, String> kv : dataToUse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should advance the watermark on each element to ensure that it is processed in order. If we advance the watermark only at the end then all the elements can be processed in parallel and there is no guarantee that the elements will be processed in the order that they were added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the the different/increasing timestamps already guarantee that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your right, each addElements call is its own batch that need to be processed.
Sort of, the issue is that the person might be writing to a different file system that isn't GCS. If you had a way to check the filesystem then you could apply the GCS limit. On the other hand it might make sense to use it anyways. |
It would be great if we could update GroupIntoBatches to honor the byte size limit for batches with more than one element and output batches of exactly one element if that one is too big. This could be done by changing the logic within GroupIntoBatches to measure the size first and only add it to the batch if adding it would make it go over. Some pseudo code:
We could update the javadoc contract to be stricter as well with this change and it would solve the GroupIntoBatches causing GCS buffering overflow problem since batches would try to be under 64mibs unless there is a large element. |
The class wasn't even available as a dependency - for a good reason -, so I just hardcoded 64MB there. |
@@ -424,13 +465,40 @@ public void processElement( | |||
BoundedWindow window, | |||
OutputReceiver<KV<K, Iterable<InputT>>> receiver) { | |||
LOG.debug("*** BATCH *** Add element for window {} ", window); | |||
if (shouldCareAboutWeight()) { | |||
final long elementWeight = weigher.apply(element.getValue()); | |||
if (elementWeight + storedBatchSizeBytes.read() > batchSizeBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this defeats the readLater optimization, since you're eagerly reading the value here (meaning also there's no point in the below readLater). you should add readLaters for minBufferedTs (if needed) and storedBatchSize earlier in the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I wasn't sure how read()
/readLater()
implementation works. Like if we read it once, then will it be cached for the whole duration already or will it fetch it again.. but I assumed readLater()
- as every prefetching method should be - is already optimized to be noop for already present values, so worst case is that we have an unnecessary noop call.
So to sum things up does that mean that a value returned by a read()
call will be always available from that point forward? Anyway modified the PR/code to reflect this.
Waiting for #20819 |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
Run Java_Examples_Dataflow PreCommit |
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java
Outdated
Show resolved
Hide resolved
sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java
Outdated
Show resolved
Hide resolved
Run Dataflow ValidatesRunner |
Run Dataflow Streaming ValidatesRunner |
Run Java Dataflow V2 ValidatesRunner |
Run Java Dataflow V2 ValidatesRunner Streaming |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
Note that I cloned this PR and added this patch and opened up a new PR. If the Dataflow tests there pass I intend to merge it and close this one. |
#24463 containing this plus a fix was merged. |
Closes #22951
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.[ ] UpdateCHANGES.md
with noteworthy changes.[ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.