Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DictionaryArray in temporal kernels #2623

Merged
merged 7 commits into from Sep 7, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 1, 2022

Which issue does this PR close?

Closes #2622.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 1, 2022
Comment on lines +33 to +42
$iter.into_iter().for_each(|value| {
if let Some(value) = value {
match $using(value) {
Some(dt) => $builder.append_value($convert(dt.$extract_fn())),
None => $builder.append_null(),
}
} else {
$builder.append_null();
}
}
})
Copy link
Member Author

@viirya viirya Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change basically rewrites the macro using ArrayAccessor API. Otherwise the logic is the same.

Previously the macro can call value_as_datetime or value_as_datetime_with_tz APIs on temporal array, but now we need to call as_datetime on the value instead.

@viirya
Copy link
Member Author

viirya commented Sep 2, 2022

cc @sunchao

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could keep the signature the same, but provide a more generic version with a suffix, e.g. _generic. Otherwise this is an ergonomic step-back for the case of a non-dictionary encoded array, which is likely by far the common case

|h| h
)
} else {
// No timezone available. Calling `to_string` on the datatime value simply.
extract_component_from_array!(array, builder, to_string, value_as_datetime, |h| h)
let iter = ArrayIter::new(array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ArrayIter needed? The macro calls into_iter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea

$builder.append_null();
} else {
match $array.$using(i) {
($iter:ident, $builder:ident, $extract_fn:ident, $using:expr, $convert:expr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a really hard time following this macro, I really wonder if we can't replace it with generics or at the very least simplify it... Something for a future PR I suspect

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It'd be better to improve the readability here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I will find some time trying to simplify it. 😅

pub fn year<T, A: ArrayAccessor<Item = T::Native>>(array: A) -> Result<Int32Array>
where
T: ArrowTemporalType + ArrowNumericType,
T::Native: ArrowNativeType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be implied by the definition of ArrowPrimitiveType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

{
match array.data_type().clone() {
DataType::Dictionary(_, value_type) => {
num_days_from_monday_internal::<T, A>(array, value_type.as_ref().clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this clone really necessary? Can't num_days_from_monday_internal take &DataType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, &DataType is okay. Removed clone.

@@ -516,7 +832,7 @@ mod tests {
let a: PrimitiveArray<Date64Type> =
vec![Some(1514764800000), None, Some(1550636625000)].into();

let b = hour(&a).unwrap();
let b = hour::<Date64Type, _>(&a).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loss of automatic type hinting is a bit unfortunate, it feels like a step-back ergonomically for the common case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored.

@viirya
Copy link
Member Author

viirya commented Sep 2, 2022

Perhaps we could keep the signature the same, but provide a more generic version with a suffix, e.g. _generic. Otherwise this is an ergonomic step-back for the case of a non-dictionary encoded array, which is likely by far the common case

Yea, I was hesitating between adding _generic version and replacing existing kernel names. Ideally I'd like to provide the kernels of same names but are generic version. But as you seen, it comes along with the loss of automatic type hinting in the end.

I will add _generic suffix to these new kernels and add back the original ones with same signature.

@viirya
Copy link
Member Author

viirya commented Sep 4, 2022

Any more comments? @sunchao @tustvold

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. Do you know whether arrow-rs follows Gregorian calendar or Proleptic Gregorian calendar in terms of handling timestamp prior to 1582-10-15?

$builder.append_null();
} else {
match $array.$using(i) {
($iter:ident, $builder:ident, $extract_fn:ident, $using:expr, $convert:expr) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It'd be better to improve the readability here.

}

/// Extracts the hours of a given temporal array as an array of integers
pub fn hour_generic<T, A: ArrayAccessor<Item = T::Native>>(array: A) -> Result<Int32Array>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this hour_generic method? I feel we just need hour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #2623 (comment). I began with a generic version of hour to replace existing one. But it will be loss of automatic type hinting as we always need to specify at least T when calling the generic hour. Seems a step back ergonomically for the common case.

So currently I provide a generic version and keep original hour signature untouched.

builder,
to_string,
|value| as_datetime::<T>(
<i64 as From<<T as ArrowPrimitiveType>::Native>>::from(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can just use <i64 as From<_>>::from(value) and let type inference do the work

@@ -171,335 +172,747 @@ pub fn using_chrono_tz_and_utc_naive_date_time(
.ok()
}

/// Extracts the hours of a given temporal array as an array of integers
/// Extracts the hours of a given temporal primitive array as an array of integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comments explaining what exactly are the returned integers, are they within range [0, 24)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, [0, 23].

}

Ok(b.finish())
}

/// Extracts the quarter of a given temporal array as an array of integers
/// Extracts the quarter of a given temporal primitive array as an array of integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: what do the returned integers represent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the quarter within the range of [1, 4].

@viirya
Copy link
Member Author

viirya commented Sep 6, 2022

Looks OK. Do you know whether arrow-rs follows Gregorian calendar or Proleptic Gregorian calendar in terms of handling timestamp prior to 1582-10-15?

It is not explicitly mentioned in this crate. But as datetime operation in arrow-rs uses chrono crate which follows ISO 8601. I think it follows Proleptic Gregorian calendar.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except minor nit on documentation

}

Ok(b.finish())
}

/// Extracts the month of a given temporal array as an array of integers
/// Extracts the month of a given temporal primitive array as an array of integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is the returned integers 0-based or 1-based? might add a comment. We should follow the methods like num_days_from_monday which have some nice documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 1-based. I will add a comment.

@viirya viirya merged commit c8bf1ca into apache:master Sep 7, 2022
@viirya
Copy link
Member Author

viirya commented Sep 7, 2022

Thanks for review.

@ursabot
Copy link

ursabot commented Sep 7, 2022

Benchmark runs are scheduled for baseline = d73d78f and contender = c8bf1ca. c8bf1ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support DictionaryArray in temporal kernels
4 participants