Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interleave kernel (#1523) #2838

Merged
merged 5 commits into from Oct 13, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 6, 2022

Which issue does this PR close?

Closes #1523

Rationale for this change

When implementing sorts, joins, etc... it is fairly common to need a method to select values from multiple source arrays and interleave them into a single output array.

I opted to call this interleave instead of take_multi, etc... as it has a non-trivially different signature, I think serves a sufficiently different use-case, and it isn't immediately obvious that the multi in take_multi refers to multiple source arrays, and not something different.

What changes are included in this PR?

Adds an interleave kernel that allows interleaving values from multiple source arrays. This is an MVP implementation, largely lifted from SortPreservingMerge in DataFusion. Future PRs will look to optimise it

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 7, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @tustvold

I had some small tests / doc suggestions, but the code looks very nice 👌

use arrow_schema::ArrowError;

///
/// Takes elements by index from a list of [`Array`], creating a new [`Array`] from those values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for the ascii art (lol though I am biased)

cc @Dandandan and @yjshen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You created it for the original ticket 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nothing like a little self praise of my monodraw skillz to lighten up the review process

/// values array 1
/// ```
///
/// For selecting values by index from a single array see [compute::take](crate::compute::take)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

///
/// For selecting values by index from a single array see [compute::take](crate::compute::take)
///
/// # Panics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function appears to return an error (rather than panic) in these two cases

Comment on lines 70 to 71
"It is not possible to interleave arrays of different data types."
.to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"It is not possible to interleave arrays of different data types."
.to_string(),
format!("It is not possible to interleave arrays of different data types ({:?} and {:?})"
array.data_type(), data_type)

return Ok(new_empty_array(data_type));
}

// TODO: Add specialized implementations (#1523)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the specialized implementation should be tracked by a different ticket than the initial kernel, right?

}

if indices.is_empty() {
return Ok(new_empty_array(data_type));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fn test_primitive_empty() {
let a = Int32Array::from_iter_values([1, 2, 3, 4]);
let v = interleave(&[&a], &[]).unwrap();
assert!(v.is_empty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert!(v.is_empty());
assert!(v.is_empty());
assert!(v.data_type(), DataType::Int32);

let b = Int32Array::from_iter_values([5, 6, 7]);
let c = Int32Array::from_iter_values([8, 9, 10]);
let values =
interleave(&[&a, &b, &c], &[(0, 3), (0, 3), (2, 2), (2, 0), (1, 1)]).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for checking repeated indexes (0,3) and (0,3)

@alamb alamb changed the title Add interleave kernel (#1523) Add interleave kernel (#1523) Oct 12, 2022
return Err(ArrowError::InvalidArgumentError(
"interleave requires input of at least one array".to_string(),
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also return an error for single array case and suggest to use compute::take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vacillated a bit on this, I think given the concat kernel which makes even less sense to be called on a single array, doesn't error in this case - I would be inclined to leave it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is nicer to support handling a single input (mostly as a convenience for downstream users so they don't have to special case their code for len() = 1 case)

@tustvold
Copy link
Contributor Author

Parquet failure is unrelated

@tustvold tustvold merged commit fa1d079 into apache:master Oct 13, 2022
@ursabot
Copy link

ursabot commented Oct 13, 2022

Benchmark runs are scheduled for baseline = 8adebca and contender = fa1d079. fa1d079 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

take kernel that works across multiple RecordBatches
4 participants