`take` kernel that works across multiple `RecordBatch`es #1523

alamb · 2022-04-05T13:07:08Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For several operations in data processing, it is important to be able to select some subset (for sorting or filtering)

For example, the current take kernel works like this:

┌─────────────────┐      ┌─────────┐                              ┌─────────────────┐
│        A        │      │    0    │                              │        A        │
├─────────────────┤      ├─────────┤                              ├─────────────────┤
│        D        │      │    2    │                              │        B        │
├─────────────────┤      ├─────────┤   take(values, indicies)     ├─────────────────┤
│        B        │      │    3    │ ─────────────────────────▶   │        C        │
├─────────────────┤      ├─────────┤                              ├─────────────────┤
│        C        │      │    1    │                              │        D        │
├─────────────────┤      └─────────┘                              └─────────────────┘
│        E        │                                                                  
└─────────────────┘                                                                  
   values array            indicies array                              result

In DataFusion, our operators get multiple record batches at a time, and we would like to do stuff like sort them without first combining into a single record batch. For example:

┌─────────────────┐                                                        
│        A        │                                                        
├─────────────────┤                                                        
│        D        │                                     ┌─────────────────┐
└─────────────────┘                                     │        A        │
  values array 0                                        ├─────────────────┤
                                                        │        B        │
                                     ?                  ├─────────────────┤
                                                        │        C        │
┌─────────────────┐     ─────────────────────────▶      ├─────────────────┤
│        B        │                                     │        D        │
├─────────────────┤                                     └─────────────────┘
│        C        │                                                        
├─────────────────┤                                                        
│        E        │                                      desired result    
└─────────────────┘                                                        
  values array 1

Describe the solution you'd like

I would like a function something like batch_take that takes a vector of RecordBatches and a list of (record_batch_index, offset_in_the_record_batch) tuples and produces the resulting array, like:

┌─────────────────┐      ┌─────────┐                                  ┌─────────────────┐
│        A        │      │ (0, 0)  │        batch_take(               │        A        │
├─────────────────┤      ├─────────┤          [values0, values1],     ├─────────────────┤
│        D        │      │ (1, 0)  │          batch_indicies          │        B        │
└─────────────────┘      ├─────────┤        )                         ├─────────────────┤
  values array 0         │ (1, 1)  │      ─────────────────────────▶  │        C        │
                         ├─────────┤                                  ├─────────────────┤
                         │ (1, 0)  │                                  │        D        │
                         └─────────┘                                  └─────────────────┘
┌─────────────────┐                                                                      
│        B        │                                                                      
├─────────────────┤   batch_indicies                                       result        
│        C        │        array                                                         
├─────────────────┤                                                                      
│        E        │                                                                      
└─────────────────┘                                                                      
  values array 1

Overtime I would expect these to become optimized in the same way as we have optimized the take kernel

This will come up in Grouping and Join operators as well.

Describe alternatives you've considered
There are two more features that @yjshen added in apache/datafusion#2132 that we might contemplate:

Take a list of (record_batch_index, offset_in_the_record_batch, num_records) to optimize the common case of copying multiple rows from each source batch.
Provide an iterator interface so that the results can be formed a batch at a time, rather than in one large array

Additional context
This came up while @yjshen was implementing a more memory efficient sort in DataFusion: apache/datafusion#2132 and suggested by @Dandandan apache/datafusion#2132 (comment)

We can probably move a bunch of the implementation from that PR to this one.

The text was updated successfully, but these errors were encountered:

Ted-Jiang · 2022-05-17T07:08:11Z

Great explanation, I am interested in this, may i have a try 😁
If this is in your plan, i am glad to see your implementation

alamb · 2022-05-17T10:52:25Z

Hi @Ted-Jiang -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket

Ted-Jiang · 2022-05-17T12:36:25Z

Hi @Ted-Jiang -- thanks! I don't have any implementations at the moment. It may be interesting to at the other linked PRs to this ticket

Sure, have found some interesting info.

* Add interleave kernel (#1523) * RAT * Review feedback

alamb added enhancement Any new improvement worthy of a entry in the changelog arrow Changes to the arrow crate performance labels Apr 5, 2022

This was referenced Apr 5, 2022

Reduce SortExec memory usage by void constructing single huge batch apache/datafusion#2132

Merged

Add a diagram to take kernel documentation #1524

Merged

tustvold self-assigned this Oct 6, 2022

This was referenced Oct 6, 2022

Deprecate MutableArrayData #2832

Closed

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022

Add interleave kernel (apache#1523)

fdb7806

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 6, 2022

Add interleave kernel (apache#1523)

d9dc4ad

tustvold mentioned this issue Oct 6, 2022

Add interleave kernel (#1523) #2838

Merged

tustvold closed this as completed in #2838 Oct 13, 2022

tustvold added a commit that referenced this issue Oct 13, 2022

Add interleave kernel (#1523) (#2838)

fa1d079

* Add interleave kernel (#1523) * RAT * Review feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`take` kernel that works across multiple `RecordBatch`es #1523

`take` kernel that works across multiple `RecordBatch`es #1523

alamb commented Apr 5, 2022 •

edited

Ted-Jiang commented May 17, 2022

alamb commented May 17, 2022

Ted-Jiang commented May 17, 2022

take kernel that works across multiple RecordBatches #1523

take kernel that works across multiple RecordBatches #1523

Comments

alamb commented Apr 5, 2022 • edited

Ted-Jiang commented May 17, 2022

alamb commented May 17, 2022

Ted-Jiang commented May 17, 2022

`take` kernel that works across multiple `RecordBatch`es #1523

`take` kernel that works across multiple `RecordBatch`es #1523

alamb commented Apr 5, 2022 •

edited