Interpolate based on other column #9616

MarcoGorelli · 2023-06-29T09:28:01Z

Problem description

Based on https://stackoverflow.com/questions/76557773/interpolate-based-on-datetimes

Say we start with

df = pl.DataFrame(
    {
        "ts": [1, 2.01, 2.00333, 4.2],
        "value": [1, None, None, 3],
    }
)

and we want to interpolate the missing values. The missing values are at very irregular intervals, so there's no prospect of using upsample

From https://stackoverflow.com/a/76564046/4451315 and https://stackoverflow.com/a/76564321/4451315, it's possible to use numpy's interp, or scipy's interpolate.interp1d

It might be nice to do this without numpy/scipy though, e.g. with one of the following:

df.interpolate(by='ts')
df.interpolate_by('ts')
df.select(pl.col('value').interpolate_by('ts'))

pandas equivalent:

In [23]: df = pd.DataFrame(
    ...:     {
    ...:         "ts": [1, 2.01, 2.00333, 4.2],
    ...:         "value": [1, None, None, 3],
    ...:     }
    ...: )

In [24]: df.set_index('ts').interpolate(method='index')
Out[24]:
            value
ts
1.00000  1.000000
2.01000  1.631250
2.00333  1.627081
4.20000  3.000000

The text was updated successfully, but these errors were encountered:

wouter-in2facts · 2023-09-13T18:53:22Z

Supporting this request, have a similar challenge
https://stackoverflow.com/questions/77099610/polars-fill-null-using-rule-of-three-based-of-filtered-set

deanm0000 · 2023-09-13T21:45:32Z

I wrote this function that I think should work to interpolate on a df with any number of value columns, id columns, and a ts column. The id column is optional.

def interp(df, y_col, id_cols=None):
    if not isinstance(y_col, str):
        raise ValueError("y_col should be string")
    if isinstance(id_cols, str):
        id_cols=[id_cols]
    if id_cols is None:
        id_cols=['__dummyid']
        df=df.with_columns(__dummyid=0)
    lf=df.select(id_cols + [y_col]).lazy()
    value_cols=[x for x in df.columns if x not in id_cols and x!=y_col]
    for value_col in value_cols:
        lf=lf.join(
            df.join_asof(
                df.filter(pl.col(value_col).is_not_null())
                .select(
                    *id_cols, y_col,
                    __value_slope=(pl.col(value_col)-pl.col(value_col).shift().over(id_cols))/(pl.col(y_col)-pl.col(y_col).shift().over(id_cols)), 
                    __value_slope_since=pl.col(y_col).shift(),
                    __value_base=pl.col(value_col).shift()
                    ),
                on=y_col, by=id_cols, strategy='forward'
            )
            .select(
                id_cols+ [y_col] + [pl.coalesce(pl.col(value_col), 
                    pl.coalesce(pl.col('__value_base'), pl.col('__value_base').shift(-1))+
                    pl.coalesce(pl.col('__value_slope'), pl.col('__value_slope').shift(-1))*(pl.col(y_col)-
                    pl.coalesce(pl.col('__value_slope_since'), pl.col('__value_slope_since').shift(-1)))).alias(value_col)]
                )
            .lazy(),
            on=[y_col]+id_cols
            )
    if id_cols[0]=='__dummyid':
        lf=lf.select(pl.exclude('__dummyid'))
    return lf.collect()

The usage is just

interp(df, 'ts')
shape: (4, 2)
┌─────────┬──────────┐
│ ts      ┆ value    │
│ ---     ┆ ---      │
│ f64     ┆ f64      │
╞═════════╪══════════╡
│ 1.0     ┆ 1.0      │
│ 2.01    ┆ 1.63125  │
│ 2.00333 ┆ 1.627081 │
│ 4.2     ┆ 3.0      │
└─────────┴──────────┘

MarcoGorelli · 2024-01-26T14:15:30Z

Accepted, but it should be interpolate_by

veylonni · 2024-03-28T15:34:28Z

Big +1 for this feature, it's the only one missing in polars for my daily tasks.
I must add that in pandas, in order to interpolate one DataFrame df on the "clock" of another one df2 (with possibly more lines), you must :

Set the index to the "by" column
Reindex by adding the "clock" of df2 in df, adding Null values
Interpolate, filling out the Null values
Another reindex to keep only the "clock" of df2
Reset the index

df = pd.DataFrame(
    {
        "ts": [1, 2.01, 2.00333, 4.2],
        "value": [1, None, None, 3],
    }
)

df2 = pd.DataFrame(
    {
        "ts": [0.0, 0.8, 1.0, 2.0, 3.0, 3.5, 4.0]
    }
)

df = df.set_index("ts")
df = (
    df.reindex(
        df.index.union(df2["ts"])
    )
    .interpolate("index")
    .reindex(df2["ts"])
    .reset_index()
)

print(df)

This is quite cumbersome, hard to remember and hard to read. This does not need to be more complicated than :

df = df.interpolate(by=df2["ts"])

What do you think ?

MarcoGorelli · 2024-03-28T15:46:30Z

Agree!

I've been wanting to do this for ages but other higher-prio issues keep coming up 😄 Thanks for bringing my attention back to it

angusl-gr · 2024-04-15T09:54:53Z

This would be really useful for me too, as would the ability to perform the interpolation within groups according to some column(s), via a by parameter or something similar.

MarcoGorelli · 2024-04-21T16:19:24Z

Alright, coming to a Polars near you once I finish cleaning it all up and glueing all things together

angusl-gr · 2024-05-09T16:49:54Z

This looks perfect, thanks Marco! Any idea how close it might be to making it into a release?

MarcoGorelli · 2024-05-09T18:22:17Z

when I get my head out of #16102 😉 I need to resolve some things before 1.0 so I've parked this, hopefully not for too long

MaxPotters · 2024-05-15T07:00:11Z

Oh yes! Exactly what I need as well. Can't wait

MarcoGorelli added the enhancement New feature or an improvement of an existing feature label Jun 29, 2023

MarcoGorelli added the A-timeseries Area: date/time functionality label Sep 5, 2023

MarcoGorelli mentioned this issue Dec 15, 2023

Resample using the nearest value #5495

Closed

MarcoGorelli added the accepted Ready for implementation label Jan 26, 2024

MarcoGorelli mentioned this issue May 18, 2024

feat: add Expr.interpolate_by #16313

Merged

ritchie46 closed this as completed in #16313 May 22, 2024

MKDJr mentioned this issue Jun 6, 2024

Interpolate based on other Float64 column #16794

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpolate based on other column #9616

Interpolate based on other column #9616

MarcoGorelli commented Jun 29, 2023 •

edited

wouter-in2facts commented Sep 13, 2023

deanm0000 commented Sep 13, 2023

MarcoGorelli commented Jan 26, 2024

veylonni commented Mar 28, 2024

MarcoGorelli commented Mar 28, 2024

angusl-gr commented Apr 15, 2024 •

edited

MarcoGorelli commented Apr 21, 2024

angusl-gr commented May 9, 2024

MarcoGorelli commented May 9, 2024

MaxPotters commented May 15, 2024

Interpolate based on other column #9616

Interpolate based on other column #9616

Comments

MarcoGorelli commented Jun 29, 2023 • edited

Problem description

wouter-in2facts commented Sep 13, 2023

deanm0000 commented Sep 13, 2023

MarcoGorelli commented Jan 26, 2024

veylonni commented Mar 28, 2024

MarcoGorelli commented Mar 28, 2024

angusl-gr commented Apr 15, 2024 • edited

MarcoGorelli commented Apr 21, 2024

angusl-gr commented May 9, 2024

MarcoGorelli commented May 9, 2024

MaxPotters commented May 15, 2024

MarcoGorelli commented Jun 29, 2023 •

edited

angusl-gr commented Apr 15, 2024 •

edited