Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forecasting incrementally with patsy #126

Open
spillz opened this issue Jun 14, 2018 · 1 comment
Open

Forecasting incrementally with patsy #126

spillz opened this issue Jun 14, 2018 · 1 comment

Comments

@spillz
Copy link

spillz commented Jun 14, 2018

This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:

import io, pandas, patsy

#raw panel data indexed on ID, YEAR. Y is the forecast variable of interest. There are no gaps in the data for an individual entity but the panel is potentially unbalanced (meaning different start/end dates).
data = '''ID,YEAR,Y,B,C,D
1,1999,0,2,3,4
1,2000,.,2,3,4
1,2001,.,2,3,4
1,2002,.,2,3,4
2,1996,1,2,3,4
2,1997,.,2,3,4
3,1998,3,2,3,4
3,1999,3,2,3,4
3,2000,.,2,3,4
3,2001,3,2,3,4
'''
data = io.StringIO(data)
df = pandas.read_csv(data, index_col=['ID','YEAR'], na_values=['.'])
print(df)

def lag(series, n=1):
    return series.groupby(level=0).shift(n)

formula = '1+lag(Y)+B+C+D' #This is the forecast equation for Y
x = patsy.dmatrix(formula,df, return_type='dataframe')
params = pandas.Series([1,2,3,4,5], index=x.columns) #these are the coefficients on the forecast vars

#Now forecast year by year
for yr in range(1997,2010):
    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]
    df.loc[ind, 'Y'] = df.loc[ind, 'Y'].fillna(x@params)
    print('================')
    print(yr)
    print(df)

Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index) 
    x = x.loc[ind]

with version of dmatrix that takes a boolean rows and only evaluates and returns the rows that are needed

    ind = df.index.get_level_values('YEAR')==yr
    x = patsy.dmatrix(formula,df, return_type='dataframe', rows=ind) #evaluates only on rows there ind==True and returns a dataframe with only those rows

I thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.

@MatthewGerber
Copy link

@spillz Did you find a solution to your problem? I've run into a similar issue. In my case, I am calling dmatrix repeatedly (e.g., tens of thousands of times), passing a different DataFrame each time. The DataFrame is small (e.g., 4 rows), but the repeated calls are quite slow. See the attached call graph from profiling.

Screen Shot 2020-12-23 at 8 31 09 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants