Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take kernel that works across multiple RecordBatches #1523

Closed
alamb opened this issue Apr 5, 2022 · 3 comments · Fixed by #2838
Closed

take kernel that works across multiple RecordBatches #1523

alamb opened this issue Apr 5, 2022 · 3 comments · Fixed by #2838
Assignees
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance

Comments

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For several operations in data processing, it is important to be able to select some subset (for sorting or filtering)

For example, the current take kernel works like this:

┌─────────────────┐      ┌─────────┐                              ┌─────────────────┐
│        A        │      │    0    │                              │        A        │
├─────────────────┤      ├─────────┤                              ├─────────────────┤
│        D        │      │    2    │                              │        B        │
├─────────────────┤      ├─────────┤   take(values, indicies)     ├─────────────────┤
│        B        │      │    3    │ ─────────────────────────▶   │        C        │
├─────────────────┤      ├─────────┤                              ├─────────────────┤
│        C        │      │    1    │                              │        D        │
├─────────────────┤      └─────────┘                              └─────────────────┘
│        E        │                                                                  
└─────────────────┘                                                                  
   values array            indicies array                              result        
                                                                                     

In DataFusion, our operators get multiple record batches at a time, and we would like to do stuff like sort them without first combining into a single record batch. For example:

┌─────────────────┐                                                        
│        A        │                                                        
├─────────────────┤                                                        
│        D        │                                     ┌─────────────────┐
└─────────────────┘                                     │        A        │
  values array 0                                        ├─────────────────┤
                                                        │        B        │
                                     ?                  ├─────────────────┤
                                                        │        C        │
┌─────────────────┐     ─────────────────────────▶      ├─────────────────┤
│        B        │                                     │        D        │
├─────────────────┤                                     └─────────────────┘
│        C        │                                                        
├─────────────────┤                                                        
│        E        │                                      desired result    
└─────────────────┘                                                        
  values array 1                                                           

Describe the solution you'd like

I would like a function something like batch_take that takes a vector of RecordBatches and a list of (record_batch_index, offset_in_the_record_batch) tuples and produces the resulting array, like:

┌─────────────────┐      ┌─────────┐                                  ┌─────────────────┐
│        A        │      │ (0, 0)  │        batch_take(               │        A        │
├─────────────────┤      ├─────────┤          [values0, values1],     ├─────────────────┤
│        D        │      │ (1, 0)  │          batch_indicies          │        B        │
└─────────────────┘      ├─────────┤        )                         ├─────────────────┤
  values array 0         │ (1, 1)  │      ─────────────────────────▶  │        C        │
                         ├─────────┤                                  ├─────────────────┤
                         │ (1, 0)  │                                  │        D        │
                         └─────────┘                                  └─────────────────┘
┌─────────────────┐                                                                      
│        B        │                                                                      
├─────────────────┤   batch_indicies                                       result        
│        C        │        array                                                         
├─────────────────┤                                                                      
│        E        │                                                                      
└─────────────────┘                                                                      
  values array 1                                                                                                                                                                                                                    

Overtime I would expect these to become optimized in the same way as we have optimized the take kernel

This will come up in Grouping and Join operators as well.

Describe alternatives you've considered
There are two more features that @yjshen added in apache/datafusion#2132 that we might contemplate:

  1. Take a list of (record_batch_index, offset_in_the_record_batch, num_records) to optimize the common case of copying multiple rows from each source batch.
  2. Provide an iterator interface so that the results can be formed a batch at a time, rather than in one large array

Additional context
This came up while @yjshen was implementing a more memory efficient sort in DataFusion: apache/datafusion#2132 and suggested by @Dandandan apache/datafusion#2132 (comment)

We can probably move a bunch of the implementation from that PR to this one.

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate performance labels Apr 5, 2022
@Ted-Jiang
Copy link
Member

Great explanation, I am interested in this, may i have a try 😁
If this is in your plan, i am glad to see your implementation

@alamb
Copy link
Contributor Author

alamb commented May 17, 2022

Hi @Ted-Jiang -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket

@Ted-Jiang
Copy link
Member

Hi @Ted-Jiang -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket

Sure, have found some interesting info.

@tustvold tustvold self-assigned this Oct 6, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022
tustvold added a commit that referenced this issue Oct 13, 2022
* Add interleave kernel (#1523)

* RAT

* Review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants