Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolate based on other column #9616

Closed
MarcoGorelli opened this issue Jun 29, 2023 · 10 comments · Fixed by #16313
Closed

Interpolate based on other column #9616

MarcoGorelli opened this issue Jun 29, 2023 · 10 comments · Fixed by #16313
Labels
A-timeseries Area: date/time functionality accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jun 29, 2023

Problem description

Based on https://stackoverflow.com/questions/76557773/interpolate-based-on-datetimes

Say we start with

df = pl.DataFrame(
    {
        "ts": [1, 2.01, 2.00333, 4.2],
        "value": [1, None, None, 3],
    }
)

and we want to interpolate the missing values. The missing values are at very irregular intervals, so there's no prospect of using upsample

From https://stackoverflow.com/a/76564046/4451315 and https://stackoverflow.com/a/76564321/4451315, it's possible to use numpy's interp, or scipy's interpolate.interp1d

It might be nice to do this without numpy/scipy though, e.g. with one of the following:

  • df.interpolate(by='ts')
  • df.interpolate_by('ts')
  • df.select(pl.col('value').interpolate_by('ts'))

pandas equivalent:

In [23]: df = pd.DataFrame(
    ...:     {
    ...:         "ts": [1, 2.01, 2.00333, 4.2],
    ...:         "value": [1, None, None, 3],
    ...:     }
    ...: )

In [24]: df.set_index('ts').interpolate(method='index')
Out[24]:
            value
ts
1.00000  1.000000
2.01000  1.631250
2.00333  1.627081
4.20000  3.000000
@MarcoGorelli MarcoGorelli added the enhancement New feature or an improvement of an existing feature label Jun 29, 2023
@MarcoGorelli MarcoGorelli added the A-timeseries Area: date/time functionality label Sep 5, 2023
@wouter-in2facts
Copy link

@deanm0000
Copy link
Collaborator

I wrote this function that I think should work to interpolate on a df with any number of value columns, id columns, and a ts column. The id column is optional.

def interp(df, y_col, id_cols=None):
    if not isinstance(y_col, str):
        raise ValueError("y_col should be string")
    if isinstance(id_cols, str):
        id_cols=[id_cols]
    if id_cols is None:
        id_cols=['__dummyid']
        df=df.with_columns(__dummyid=0)
    lf=df.select(id_cols + [y_col]).lazy()
    value_cols=[x for x in df.columns if x not in id_cols and x!=y_col]
    for value_col in value_cols:
        lf=lf.join(
            df.join_asof(
                df.filter(pl.col(value_col).is_not_null())
                .select(
                    *id_cols, y_col,
                    __value_slope=(pl.col(value_col)-pl.col(value_col).shift().over(id_cols))/(pl.col(y_col)-pl.col(y_col).shift().over(id_cols)), 
                    __value_slope_since=pl.col(y_col).shift(),
                    __value_base=pl.col(value_col).shift()
                    ),
                on=y_col, by=id_cols, strategy='forward'
            )
            .select(
                id_cols+ [y_col] + [pl.coalesce(pl.col(value_col), 
                    pl.coalesce(pl.col('__value_base'), pl.col('__value_base').shift(-1))+
                    pl.coalesce(pl.col('__value_slope'), pl.col('__value_slope').shift(-1))*(pl.col(y_col)-
                    pl.coalesce(pl.col('__value_slope_since'), pl.col('__value_slope_since').shift(-1)))).alias(value_col)]
                )
            .lazy(),
            on=[y_col]+id_cols
            )
    if id_cols[0]=='__dummyid':
        lf=lf.select(pl.exclude('__dummyid'))
    return lf.collect()

The usage is just

interp(df, 'ts')
shape: (4, 2)
┌─────────┬──────────┐
│ ts      ┆ value    │
│ ---     ┆ ---      │
│ f64     ┆ f64      │
╞═════════╪══════════╡
│ 1.0     ┆ 1.0      │
│ 2.01    ┆ 1.63125  │
│ 2.00333 ┆ 1.627081 │
│ 4.2     ┆ 3.0      │
└─────────┴──────────┘

@MarcoGorelli MarcoGorelli added the accepted Ready for implementation label Jan 26, 2024
@MarcoGorelli
Copy link
Collaborator Author

Accepted, but it should be interpolate_by

@veylonni
Copy link

Big +1 for this feature, it's the only one missing in polars for my daily tasks.
I must add that in pandas, in order to interpolate one DataFrame df on the "clock" of another one df2 (with possibly more lines), you must :

  1. Set the index to the "by" column
  2. Reindex by adding the "clock" of df2 in df, adding Null values
  3. Interpolate, filling out the Null values
  4. Another reindex to keep only the "clock" of df2
  5. Reset the index
df = pd.DataFrame(
    {
        "ts": [1, 2.01, 2.00333, 4.2],
        "value": [1, None, None, 3],
    }
)

df2 = pd.DataFrame(
    {
        "ts": [0.0, 0.8, 1.0, 2.0, 3.0, 3.5, 4.0]
    }
)

df = df.set_index("ts")
df = (
    df.reindex(
        df.index.union(df2["ts"])
    )
    .interpolate("index")
    .reindex(df2["ts"])
    .reset_index()
)

print(df)

This is quite cumbersome, hard to remember and hard to read. This does not need to be more complicated than :

df = df.interpolate(by=df2["ts"])

What do you think ?

@MarcoGorelli
Copy link
Collaborator Author

Agree!

I've been wanting to do this for ages but other higher-prio issues keep coming up 😄 Thanks for bringing my attention back to it

@angusl-gr
Copy link

angusl-gr commented Apr 15, 2024

This would be really useful for me too, as would the ability to perform the interpolation within groups according to some column(s), via a by parameter or something similar.

@MarcoGorelli
Copy link
Collaborator Author

Alright, coming to a Polars near you once I finish cleaning it all up and glueing all things together

image

@angusl-gr
Copy link

This looks perfect, thanks Marco! Any idea how close it might be to making it into a release?

@MarcoGorelli
Copy link
Collaborator Author

when I get my head out of #16102 😉 I need to resolve some things before 1.0 so I've parked this, hopefully not for too long

@MaxPotters
Copy link

Oh yes! Exactly what I need as well. Can't wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants