Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Properly handle records with multiple rows in batching #1647

Merged
merged 17 commits into from May 15, 2024

Conversation

candiduslynx
Copy link
Contributor

@candiduslynx candiduslynx commented Apr 22, 2024

Copy link

github-actions bot commented Apr 22, 2024

⏱️ Benchmark results

Comparing with dc31c3a

  • Glob-8 ns/op: 94.5 ⬇️ 2.71% decrease vs. dc31c3a

Co-authored-by: Kemal <223029+disq@users.noreply.github.com>
@candiduslynx candiduslynx requested a review from disq April 22, 2024 17:49
@github-actions github-actions bot added fix and removed fix labels Apr 23, 2024
@candiduslynx candiduslynx requested a review from disq April 23, 2024 09:50
@candiduslynx
Copy link
Contributor Author

candiduslynx commented May 11, 2024

TODO: Consider making a slicer to handle batching by rows & size.
This should produce several records if the new appended record has too much data (e.g. batch size is 5, but we get a record with 100 rows).

This is complicated by the fact that different batchers use different messages, so an overhaul might be required for this.

@github-actions github-actions bot added fix and removed fix labels May 12, 2024
Copy link
Contributor

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, the main ones are:

  1. Don't think we need the abstraction over int64 as it's basically wrapping +=, > and Get
  2. This seems to change the logic to flush after the limit had reached instead of before, it won't be an issue for limiting by rows I think (since it works the same for single row records, and is broken for multi row records anyway), but can be an issue for limiting by bytes. Regardless probably best to flush before going over the limit

Also per your comment in #1647 (comment) if we roll out multi-rows this will become an issue for batching both by rows and by bytes, and with single rows records we're less likely to exceed the batch size.
Not sure what to do about that, and how slicing can impact the performance improvements we get from multi records

writers/cap.go Outdated Show resolved Hide resolved
writers/batchwriter/batchwriter.go Outdated Show resolved Hide resolved
writers/streamingbatchwriter/streamingbatchwriter.go Outdated Show resolved Hide resolved
writers/mixedbatchwriter/mixedbatchwriter.go Outdated Show resolved Hide resolved
@kodiakhq kodiakhq bot merged commit 926a7fc into main May 15, 2024
8 checks passed
@kodiakhq kodiakhq bot deleted the fix/batch-by-rows branch May 15, 2024 14:32
kodiakhq bot pushed a commit that referenced this pull request May 15, 2024
🤖 I have created a release *beep* *boop*
---


## [4.42.1](v4.42.0...v4.42.1) (2024-05-15)


### Bug Fixes

* Correct error message on Read failure ([#1680](#1680)) ([dc31c3a](dc31c3a))
* Properly handle records with multiple rows in batching ([#1647](#1647)) ([926a7fc](926a7fc))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants