Forecasting incrementally with patsy #126

spillz · 2018-06-14T21:25:27Z

This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:

import io, pandas, patsy

#raw panel data indexed on ID, YEAR. Y is the forecast variable of interest. There are no gaps in the data for an individual entity but the panel is potentially unbalanced (meaning different start/end dates).
data = '''ID,YEAR,Y,B,C,D
1,1999,0,2,3,4
1,2000,.,2,3,4
1,2001,.,2,3,4
1,2002,.,2,3,4
2,1996,1,2,3,4
2,1997,.,2,3,4
3,1998,3,2,3,4
3,1999,3,2,3,4
3,2000,.,2,3,4
3,2001,3,2,3,4
'''
data = io.StringIO(data)
df = pandas.read_csv(data, index_col=['ID','YEAR'], na_values=['.'])
print(df)

def lag(series, n=1):
    return series.groupby(level=0).shift(n)

formula = '1+lag(Y)+B+C+D' #This is the forecast equation for Y
x = patsy.dmatrix(formula,df, return_type='dataframe')
params = pandas.Series([1,2,3,4,5], index=x.columns) #these are the coefficients on the forecast vars

#Now forecast year by year
for yr in range(1997,2010):
    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]
    df.loc[ind, 'Y'] = df.loc[ind, 'Y'].fillna(x@params)
    print('================')
    print(yr)
    print(df)

Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]

with version of dmatrix that takes a boolean rows and only evaluates and returns the rows that are needed

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe', rows=ind) #evaluates only on rows there ind==True and returns a dataframe with only those rows

I thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.

The text was updated successfully, but these errors were encountered:

MatthewGerber · 2020-12-23T13:44:59Z

@spillz Did you find a solution to your problem? I've run into a similar issue. In my case, I am calling dmatrix repeatedly (e.g., tens of thousands of times), passing a different DataFrame each time. The DataFrame is small (e.g., 4 rows), but the repeated calls are quite slow. See the attached call graph from profiling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forecasting incrementally with patsy #126

Forecasting incrementally with patsy #126

spillz commented Jun 14, 2018

MatthewGerber commented Dec 23, 2020

Forecasting incrementally with patsy #126

Forecasting incrementally with patsy #126

Comments

spillz commented Jun 14, 2018

MatthewGerber commented Dec 23, 2020