Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Expr.interpolate_by #16313

Merged
merged 11 commits into from
May 22, 2024
Merged

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented May 18, 2024

closes #9616

This implements linear interpolation based on another column

It doesn't not assume the data to be sorted by by (i.e. it already sorts on the user's behalf)

Follow-ups:

  • allow for missing values in by column (this needs doing for rolling_*_by too)
  • support 'nearest' interpolation too

Copy link

codecov bot commented May 18, 2024

Codecov Report

Attention: Patch coverage is 97.77778% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 81.39%. Comparing base (7cf55e0) to head (07d3b41).
Report is 25 commits behind head on main.

Current head 07d3b41 differs from pull request most recent head 7d0cfd3

Please upload reports for the commit 7d0cfd3 to get more accurate results.

Files Patch % Lines
...ops/src/series/ops/interpolation/interpolate_by.rs 98.68% 3 Missing ⚠️
crates/polars-plan/src/dsl/function_expr/mod.rs 50.00% 2 Missing ⚠️
py-polars/src/lazyframe/visitor/expr_nodes.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16313      +/-   ##
==========================================
+ Coverage   81.37%   81.39%   +0.02%     
==========================================
  Files        1403     1405       +2     
  Lines      183729   183996     +267     
  Branches     2954     2954              
==========================================
+ Hits       149501   149763     +262     
- Misses      33717    33722       +5     
  Partials      511      511              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli MarcoGorelli force-pushed the interpolate-by branch 3 times, most recently from 0a85fdf to c47b24b Compare May 19, 2024 13:26
Comment on lines -6 to -7
#[cfg(feature = "interpolate")]
mod interpolate;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving this out of chunked_array and into series, as the public function operates on (and returns) a Series

@MarcoGorelli MarcoGorelli changed the title wip: feat: add Expr.interpolate_by feat: add Expr.interpolate_by May 19, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 19, 2024
@ritchie46
Copy link
Member

It doesn't not assume the data to be sorted by by (i.e. it already sorts on the user's behalf)

Does data need to be sorted? We could interpolate any arbitrary function right? 🤔

@MarcoGorelli
Copy link
Collaborator Author

Does data need to be sorted? We could interpolate any arbitrary function right? 🤔

For linear interpolation based on another column, I'd say it needs to be sorted - if you consider the image I posted here, then you'd expect to get those particular interpolated points, regardless of the order you presented the data in

All I meant was that, as with ewm_mean_by (and, hopefully, rolling_*_by #16249), if the algorithm requires sorted data, then it takes care of it on behalf of the user

@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 20, 2024 07:51
@MarcoGorelli MarcoGorelli marked this pull request as draft May 20, 2024 07:54
@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 20, 2024 07:59
@ritchie46
Copy link
Member

Oh, yes. Make sense. The by column is in this case the x in f(x). I understand. 👌

}
}

pub fn interpolate_by(s: &Series, by: &Series, assume_sorted: bool) -> PolarsResult<Series> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter can be removed and we can use the is_sorted check.


// Fill out with first.
let mut out = Vec::with_capacity(chunked_arr.len());
let mut iter = chunked_arr.into_iter().enumerate().skip(first);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunked_arr.iter is faster as it doesn't box the allocator.


let mut low_val = None;
let mut low_idx = None;
loop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop match next() -> None -> break can be written as:

while let Some(val) = iter.next() {


}

low_idx = Some(idx);
},
Some((_idx, None)) => {
loop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while let ...

let mut validity = MutableBitmap::with_capacity(chunked_arr.len());
validity.extend_constant(chunked_arr.len(), true);

for i in 0..first {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do these without bound checks if we can have cases where first is large.

interpolation_branch(
low_val.expect("We started iterating at `first`"),
high,
&by_values[low_idx.unwrap()..=high_idx],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be unchecked, it is in the hot loop.

let last = ca_sorted.last_non_null().unwrap() + 1;

let mut out = zeroed_vec(ca_sorted.len());
let mut iter = ca_sorted.into_iter().enumerate().skip(first);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca.iter() is the faster variant.


let mut low_val = None;
let mut low_idx = None;
loop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while let..

let mut out = zeroed_vec(ca_sorted.len());
let mut iter = ca_sorted.into_iter().enumerate().skip(first);

let mut low_val = None;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

validity.extend_constant(ca_sorted.len(), true);

for i in 0..first {
let out_idx = unsafe { sorting_indices.get_unchecked(i) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same..

@MarcoGorelli MarcoGorelli marked this pull request as draft May 20, 2024 14:59
@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 22, 2024 09:39
@ritchie46 ritchie46 merged commit 5429aba into pola-rs:main May 22, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interpolate based on other column
2 participants