Fix for #22951 #22953

nbali · 2022-08-30T02:08:02Z

Closes #22951

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
~~[ ] Update CHANGES.md with noteworthy changes.~~
~~[ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.~~

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

nbali · 2022-08-30T02:10:44Z

R: @lukecwik

github-actions · 2022-08-30T02:12:12Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

nbali · 2022-08-30T02:12:35Z

fyi, most than likely it requires spotless, used github.dev for it

nbali · 2022-08-30T15:01:25Z

irrelevant test failure with "Java Tests / Java Wordcount Direct Runner (windows-latest) (pull_request)"

nbali · 2022-08-31T02:05:31Z

Run Java PreCommit

lukecwik · 2022-09-01T20:14:18Z

Run Java PreCommit

lukecwik · 2022-09-01T20:28:00Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java

@@ -113,6 +113,9 @@
  // If user triggering is supplied, we will trigger the file write after this many records are
  // written.
  static final int FILE_TRIGGERING_RECORD_COUNT = 500000;
+  // If user triggering is supplied, we will trigger the file write after this many bytes are
+  // written.
+  static final long FILE_TRIGGERING_BYTE_COUNT = 100 * (1L << 20); // 100MiB


It looks like we already have a memory limit for writing of 20 parallel writers with 64mb buffers. Should we limit this triggering to be 64mbs as well so that it fits in one chunk?

CC: @reuvenlax Any suggestions here?

@lukecwik Having the same limit as a buffer actually makes sense to me, but can you direct me towards where might I find that limit? I can see it in the comments for DEFAULT_MAX_NUM_WRITERS_PER_BUNDLE, but instead of hardcoding 64MB here as well, I would rather reference the original limit directly.

The default comes from this constant:
https://www.javadoc.io/static/com.google.cloud.bigdataoss/util/1.9.17/com/google/cloud/hadoop/util/AsyncWriteChannelOptions.html#UPLOAD_CHUNK_SIZE_DEFAULT

The user can override the default using this pipeline option:

beam/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcsOptions.java

Line 91 in b2a6f46

Integer getGcsUploadBufferSizeBytes();

done in bcd4ba9 (#22953)

@lukecwik
On second thought, I think there is a problem with using this 64MB default. We only flush the batch inside GroupIntoBatches, once the storedBatchSizeBytes is greater than or equal to the limit. So if we make the limit 64MB, more than likely we will flush just a bit more than 64MB so we won't fit into the 64MB buffer.

So either the triggering byte count should be x% smaller than the 64MB default, or GroupIntoBatches has to be modified that it the current element would make it go over the byte size limit, then fire the batch without that element being added to the it first. The second seems like a better solution, but I assume doing the storedBatchSizeBytes.read() sooner would have performance impact.

Sorry, didn't see this comment but I agree that we should change GroupIntoBatches to ensure that if we add an element that would make it go over the limit we would flush a batch.

Pseudo-code would be like:

byteSize = measure(obj) if (byteSize >= byteSizeLimit) { output obj as single element batch continue } if (byteSize + previousNumBytes > byteSizeLimit) { output all buffered elements as batch } else { add obj to batch }

I used a different algo, but IMO it stays closed to the original concept of the transform now.

lukecwik · 2022-09-01T20:32:25Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java

@@ -117,6 +117,11 @@
   */
  @AutoValue
  public abstract static class BatchingParams<InputT> implements Serializable {


It looks like your adding support for GroupIntoBatches to limit on count and byte size at the same time.

Can you add tests that cover this new scenario to:

GroupIntoBatchesTest

GroupIntoBatchesTranslationTest

see 040b744 (#22953) and b1b732c (#22953)

codecov · 2022-09-06T19:37:50Z

Codecov Report

Merging #22953 (b2a6f46) into master (1dd2ccd) will increase coverage by 0.14%.
The diff coverage is n/a.

❗ Current head b2a6f46 differs from pull request most recent head fa1bd88. Consider uploading reports for the commit fa1bd88 to get more accurate results

@@            Coverage Diff             @@
##           master   #22953      +/-   ##
==========================================
+ Coverage   73.45%   73.59%   +0.14%     
==========================================
  Files         714      716       +2     
  Lines       96497    95282    -1215     
==========================================
- Hits        70886    70127     -759     
+ Misses      24289    23859     -430     
+ Partials     1322     1296      -26

Flag	Coverage Δ
go	`50.94% <0.00%> (-0.52%)`	⬇️
python	`83.42% <0.00%> (+0.26%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/go/pkg/beam/core/metrics/dumper.go	`0.00% <0.00%> (-53.97%)`	⬇️
sdks/go/pkg/beam/core/metrics/store.go	`10.63% <0.00%> (-25.54%)`	⬇️
sdks/go/pkg/beam/runners/direct/gbk.go	`72.35% <0.00%> (-11.24%)`	⬇️
sdks/go/pkg/beam/core/metrics/metrics.go	`49.36% <0.00%> (-10.93%)`	⬇️
sdks/go/pkg/beam/register/emitter.go	`47.69% <0.00%> (-6.37%)`	⬇️
sdks/go/pkg/beam/options/jobopts/options.go	`87.27% <0.00%> (-5.91%)`	⬇️
sdks/go/pkg/beam/register/iter.go	`67.21% <0.00%> (-5.76%)`	⬇️
...ython/apache_beam/io/gcp/experimental/spannerio.py	`82.15% <0.00%> (-4.79%)`	⬇️
.../python/apache_beam/testing/test_stream_service.py	`88.09% <0.00%> (-4.77%)`	⬇️
...ks/python/apache_beam/runners/worker/statecache.py	`89.69% <0.00%> (-4.37%)`	⬇️
... and 153 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…rk layer

…chesTest

nbali · 2022-09-06T23:31:30Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

-
-              private boolean checkBatchSizes(Iterable<KV<String, Iterable<String>>> listToCheck) {
-                for (KV<String, Iterable<String>> element : listToCheck) {
-                  if (Iterables.size(element.getValue()) != BATCH_SIZE) {


I did notice that it's != and not > here, but the test is still valid with > (we have 10 elements, and 5 batch size, so it can't be anything but 5, and we check the batch count at the end with EVEN_NUM_ELEMENTS / BATCH_SIZE)

I would say that this previous test was too strict and your update makes sense. GroupIntoBatches ensures that the batches aren't bigger than BATCH_SIZE elements.

Unfortunately I think the GroupIntoBatches specification is too loose since it uses words like Aim to create batches. It would be great if we could make it a strict guarantee, for example batches will never be bigger then element count, or that they will never be bigger then byte size (except for the case where a single element is bigger then byte size and it will show up in its own group). I wouldn't try to solve this here but it would make sense to have a bug/and or follow-up PR to make this explicit.

Actually I think I did solved that. I mean apart from the inaccuracy of the weigher.

…chesTranslationTest

… introduced test in GroupIntoBatchesTest

nbali · 2022-09-07T03:50:44Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

+    }
+
+    // fire them all at once
+    TestStream<KV<String, String>> stream = streamBuilder.advanceWatermarkToInfinity();


@lukecwik is there any other (so not TestStream+TimestampedValue) simpler way to guarantee the order of the elements that I missed?

Using TestStream is a way to ensure that the output is produced and processed in a specific order ensuring exact output conditions. It makes it easier to write pipeline level integration tests for exact scenarios.

As the other tests have done the other option is to use PAssert with a custom matcher that passes for any valid combination of outputs. It is difficult to have a meaningful test for cases where runner determined re-ordering can produce lots of different combinations of output that are valid. Typically one just writes a check to make sure that certain properties are satisfied. For GroupIntoBatches with both limits you would ensure that if you take out the largest element from the group then the byte size is less than the limit and also ensure that the group into batches element count is never greater than the specified element count.

These properties would ensure that the GroupIntoBatches was honoring the contract but a naive implementation could choose to group inefficiently (e.g. each batch is one element) which would still be valid.

nbali · 2022-09-08T01:37:39Z

Run Java PreCommit

nbali · 2022-09-09T02:25:58Z

Run Java PreCommit

nbali · 2022-09-09T16:04:28Z

Run Java PreCommit

nbali · 2022-09-12T13:02:03Z

Note to self: the same happens with https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java

nbali · 2022-09-12T13:04:41Z

@lukecwik Can I use the same PipelineOptions there (in WriteFiles) as well? Does it use the same network layer?

lukecwik

Sorry for the long wait.

lukecwik · 2022-10-04T23:19:44Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

+    }
+
+    // fire them all at once
+    TestStream<KV<String, String>> stream = streamBuilder.advanceWatermarkToInfinity();


Using TestStream is a way to ensure that the output is produced and processed in a specific order ensuring exact output conditions. It makes it easier to write pipeline level integration tests for exact scenarios.

As the other tests have done the other option is to use PAssert with a custom matcher that passes for any valid combination of outputs. It is difficult to have a meaningful test for cases where runner determined re-ordering can produce lots of different combinations of output that are valid. Typically one just writes a check to make sure that certain properties are satisfied. For GroupIntoBatches with both limits you would ensure that if you take out the largest element from the group then the byte size is less than the limit and also ensure that the group into batches element count is never greater than the specified element count.

These properties would ensure that the GroupIntoBatches was honoring the contract but a naive implementation could choose to group inefficiently (e.g. each batch is one element) which would still be valid.

lukecwik · 2022-10-04T23:26:18Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

-
-              private boolean checkBatchSizes(Iterable<KV<String, Iterable<String>>> listToCheck) {
-                for (KV<String, Iterable<String>> element : listToCheck) {
-                  if (Iterables.size(element.getValue()) != BATCH_SIZE) {


I would say that this previous test was too strict and your update makes sense. GroupIntoBatches ensures that the batches aren't bigger than BATCH_SIZE elements.

Unfortunately I think the GroupIntoBatches specification is too loose since it uses words like Aim to create batches. It would be great if we could make it a strict guarantee, for example batches will never be bigger then element count, or that they will never be bigger then byte size (except for the case where a single element is bigger then byte size and it will show up in its own group). I wouldn't try to solve this here but it would make sense to have a bug/and or follow-up PR to make this explicit.

lukecwik · 2022-10-04T23:28:08Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

@@ -267,20 +225,9 @@ public void testWithShardedKeyInGlobalWindow() {
    PAssert.that("Incorrect batch size in one or more elements", collection)


We should move the comment just above into checkBatchSizes:

// Since with default sharding, the number of subshards of a key is nondeterministic, create // a large number of input elements and a small batch size and check there is no batch larger // than the specified size.

Erhm, isn't this comment only valid for .withShardedKey()?

lukecwik · 2022-10-04T23:29:21Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

+  })
+  public void testMultipleLimitsAtOnceInGlobalWindowBatchSizeCountAndBatchSizeByteSize() {
+    // with using only one of the limits the result would be only 2 batches,
+    // if we have 3 both limit works


Suggested change

// if we have 3 both limit works

// if we have 3 both limits are exercised

done in 6abe4cd (#22953)

lukecwik · 2022-10-04T23:29:36Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

+            .map(s -> KV.of("key", s))
+            .collect(Collectors.toList());
+
+    // to ensure ordered firing


Suggested change

// to ensure ordered firing

// to ensure ordered processing

done in 6abe4cd (#22953)

lukecwik · 2022-10-04T23:31:07Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

+            .advanceWatermarkTo(Instant.EPOCH);
+
+    long offset = 0L;
+    for (KV<String, String> kv : dataToUse) {


We should advance the watermark on each element to ensure that it is processed in order. If we advance the watermark only at the end then all the elements can be processed in parallel and there is no guarantee that the elements will be processed in the order that they were added.

Won't the the different/increasing timestamps already guarantee that?

Your right, each addElements call is its own batch that need to be processed.

lukecwik · 2022-10-04T23:40:52Z

@lukecwik Can I use the same PipelineOptions there (in WriteFiles) as well? Does it use the same network layer?

Sort of, the issue is that the person might be writing to a different file system that isn't GCS. If you had a way to check the filesystem then you could apply the GCS limit. On the other hand it might make sense to use it anyways.

lukecwik · 2022-10-04T23:45:31Z

It would be great if we could update GroupIntoBatches to honor the byte size limit for batches with more than one element and output batches of exactly one element if that one is too big.

This could be done by changing the logic within GroupIntoBatches to measure the size first and only add it to the batch if adding it would make it go over. Some pseudo code:

byteSize = measure(obj)
if (byteSize >= byteSizeLimit) {
  output obj as single element batch
  continue
}
if (byteSize + previousNumBytes > byteSizeLimit) {
  output all buffered elements as batch
} else {
  add obj to batch
}

We could update the javadoc contract to be stricter as well with this change and it would solve the GroupIntoBatches causing GCS buffering overflow problem since batches would try to be under 64mibs unless there is a large element.

…er the limit + PR CR

nbali · 2022-11-15T14:54:40Z

@lukecwik Can I use the same PipelineOptions there (in WriteFiles) as well? Does it use the same network layer?

Sort of, the issue is that the person might be writing to a different file system that isn't GCS. If you had a way to check the filesystem then you could apply the GCS limit. On the other hand it might make sense to use it anyways.

The class wasn't even available as a dependency - for a good reason -, so I just hardcoded 64MB there.

…in unit tests

reuvenlax · 2022-11-16T14:05:28Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java

@@ -424,13 +465,40 @@ public void processElement(
        BoundedWindow window,
        OutputReceiver<KV<K, Iterable<InputT>>> receiver) {
      LOG.debug("*** BATCH *** Add element for window {} ", window);
+      if (shouldCareAboutWeight()) {
+        final long elementWeight = weigher.apply(element.getValue());
+        if (elementWeight + storedBatchSizeBytes.read() > batchSizeBytes) {


this defeats the readLater optimization, since you're eagerly reading the value here (meaning also there's no point in the below readLater). you should add readLaters for minBufferedTs (if needed) and storedBatchSize earlier in the function.

TBH I wasn't sure how read()/readLater() implementation works. Like if we read it once, then will it be cached for the whole duration already or will it fetch it again.. but I assumed readLater() - as every prefetching method should be - is already optimized to be noop for already present values, so worst case is that we have an unnecessary noop call.

So to sum things up does that mean that a value returned by a read() call will be always available from that point forward? Anyway modified the PR/code to reflect this.

…ectations

nbali · 2022-11-18T05:08:33Z

Waiting for #20819

nbali · 2022-11-23T06:25:55Z

Run Java PreCommit

nbali · 2022-11-29T00:09:16Z

Run Java PreCommit

nbali · 2022-11-29T18:06:20Z

Run Java_Examples_Dataflow PreCommit

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java

lukecwik · 2022-11-30T22:52:57Z

Run Dataflow ValidatesRunner

lukecwik · 2022-11-30T22:53:19Z

Run Dataflow Streaming ValidatesRunner

lukecwik · 2022-11-30T22:53:29Z

Run Java Dataflow V2 ValidatesRunner

lukecwik · 2022-11-30T22:53:42Z

Run Java Dataflow V2 ValidatesRunner Streaming

lukecwik · 2022-12-01T04:38:51Z

Run Java PreCommit

lukecwik · 2022-12-01T16:40:49Z

Run Java PreCommit

lukecwik · 2022-12-01T21:26:11Z

Note that I cloned this PR and added this patch and opened up a new PR. If the Dataflow tests there pass I intend to merge it and close this one.

lukecwik · 2022-12-02T17:34:36Z

#24463 containing this plus a fix was merged.

Fix for apache#22951

2eacbd4

github-actions bot added gcp io java labels Aug 30, 2022

nbali force-pushed the fix-for-22951 branch from 429a50f to 158d329 Compare August 30, 2022 02:34

Compilation fix

ff6b5a0

nbali force-pushed the fix-for-22951 branch from 158d329 to ff6b5a0 Compare August 30, 2022 03:02

Spotless fix

5f79ed6

lukecwik requested changes Sep 1, 2022

View reviewed changes

nbali added 2 commits September 6, 2022 23:19

Fix for apache#22951 - PR CR - using the same chunk size as the netwo…

bcd4ba9

…rk layer

Fix for apache#22951 - PR CR - added new test cases into GroupIntoBat…

040b744

…chesTest

nbali commented Sep 6, 2022

View reviewed changes

Fix for apache#22951 - PR CR - added new test cases into GroupIntoBat…

b1b732c

…chesTranslationTest

github-actions bot added core runners labels Sep 7, 2022

Fix for apache#22951 - PR CR - guaranteed element order for the newly…

70a92b6

… introduced test in GroupIntoBatchesTest

nbali commented Sep 7, 2022

View reviewed changes

lukecwik reviewed Oct 4, 2022

View reviewed changes

nbali added 3 commits November 15, 2022 14:59

apache#22591 Firing GroupIntoBatches earlier if byte size would go ov…

6abe4cd

…er the limit + PR CR

Spotless

3e53b0f

apache#22951 Adding byte size limit to WriteFiles transform as well

f00c85d

apache#22951 Enforcing the new expectations for the 'byteSize' limit …

5840e55

…in unit tests

nbali force-pushed the fix-for-22951 branch from 340f99d to 5840e55 Compare November 16, 2022 13:54

reuvenlax reviewed Nov 16, 2022

View reviewed changes

nbali force-pushed the fix-for-22951 branch from 1e1770b to c752e37 Compare November 17, 2022 00:35

Fixing State.readLater() usage + Unit test fix to reflect changed exp…

6d016f1

…ectations

nbali force-pushed the fix-for-22951 branch from c752e37 to 6d016f1 Compare November 17, 2022 17:29

nbali added 2 commits November 17, 2022 23:31

Merge branch 'master' into fix-for-22951

9f874f7

Merge branch 'master' into fix-for-22951

5a3ee7f

lukecwik approved these changes Nov 30, 2022

View reviewed changes

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java Outdated Show resolved Hide resolved

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java Outdated Show resolved Hide resolved

Apply suggestions from code review

fa1bd88

lukecwik mentioned this pull request Dec 1, 2022

Fix for #22951 w/ GroupIntoBatchesOverride fix #24463

Merged

4 tasks

lukecwik closed this Dec 2, 2022

		@@ -267,20 +225,9 @@ public void testWithShardedKeyInGlobalWindow() {
		PAssert.that("Incorrect batch size in one or more elements", collection)

	// if we have 3 both limit works
	// if we have 3 both limits are exercised

Fix for #22951 #22953

Fix for #22951 #22953

Conversation

nbali commented Aug 30, 2022 • edited

GitHub Actions Tests Status (on master branch)

nbali commented Aug 30, 2022

github-actions bot commented Aug 30, 2022

nbali commented Aug 30, 2022

nbali commented Aug 30, 2022

nbali commented Aug 31, 2022

lukecwik commented Sep 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 6, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbali Sep 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbali commented Sep 8, 2022

nbali commented Sep 9, 2022

nbali commented Sep 9, 2022

nbali commented Sep 12, 2022 • edited

nbali commented Sep 12, 2022 • edited

lukecwik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukecwik Nov 30, 2022 • edited

Choose a reason for hiding this comment

lukecwik commented Oct 4, 2022

lukecwik commented Oct 4, 2022 • edited

nbali commented Nov 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbali commented Nov 18, 2022

nbali commented Nov 23, 2022

nbali commented Nov 29, 2022

nbali commented Nov 29, 2022

lukecwik commented Nov 30, 2022

lukecwik commented Nov 30, 2022

lukecwik commented Nov 30, 2022

lukecwik commented Nov 30, 2022

lukecwik commented Dec 1, 2022

lukecwik commented Dec 1, 2022

lukecwik commented Dec 1, 2022

lukecwik commented Dec 2, 2022

nbali commented Aug 30, 2022 •

edited

codecov bot commented Sep 6, 2022 •

edited

nbali Sep 7, 2022 •

edited

nbali commented Sep 12, 2022 •

edited

nbali commented Sep 12, 2022 •

edited

lukecwik Nov 30, 2022 •

edited

lukecwik commented Oct 4, 2022 •

edited