Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to reference data from index (of pandas dataframes) #3331

Open
fhg-isi opened this issue Feb 14, 2024 · 3 comments
Open

Allow to reference data from index (of pandas dataframes) #3331

fhg-isi opened this issue Feb 14, 2024 · 3 comments

Comments

@fhg-isi
Copy link

fhg-isi commented Feb 14, 2024

"By design Altair only accesses dataframe columns, not dataframe indices":
https://altair-viz.github.io/user_guide/data.html#including-index-data

Please consider to support indexed pandas dataframes in a future altair version. Also see

https://stackoverflow.com/questions/77993730/how-to-use-indexed-data-frames-with-altair/

@binste
Copy link
Contributor

binste commented Feb 14, 2024

Thanks for the suggestion. I see that it's easier to not have to write .reset_index(). The challenge is that Altair would need to call .reset_index() internally for every Pandas dataframe to make the index accessible. In many cases where the index is not needed for the chart, it would lead to unnecessary data being added to the Vega-Lite specification, see Altair Internals.

For this to work, Altair would need to know when the index is used and when it isn't so it can call .reset_index in only those cases. Altair can only do this with the help of VegaFusion, see #2428 for details on why.

In short, I think it adds a lot of complexity to make this work and it would only work with VegaFusion which is an additional dependency. I'll leave this open in case I'm missing something.

@fhg-isi
Copy link
Author

fhg-isi commented Feb 14, 2024

  • Index could be accessed without resetting the data_frame (e.g. df.index.values) ?
  • Type of index could be checked ? If its not RangeIndex, consider it as explicit index ?
  • Reset could only be called in the required cases where some expression like "$index" is used ?
  • Spend extra method or option for indexed dataframes (use_index=True) ) ?
  • Allow index as type for mapping x = df.index ?

Also see

https://stackoverflow.com/questions/20084487/use-index-in-pandas-to-plot-data

import pandas as pd

df = pd.DataFrame(
    [
        {'id_foo': 1, 'energy_carrier': 'oil', '2000': 5, '2020': 10},
        {'id_foo': 2, 'energy_carrier': 'electricity', '2000': 10, '2020': 20},
    ]
)

print(type(df.index))   # <class 'pandas.core.indexes.range.RangeIndex'>

indexed_df = df.pivot_table(
    columns='energy_carrier',
    values=['2000', '2020'],
    aggfunc='sum',
)

print(type(indexed_df.index))  # <class 'pandas.core.indexes.base.Index'>

df.set_index('id_foo', inplace=True)  # <class 'pandas.core.indexes.numeric.Int64Index'>

print(type(df.index))

@binste
Copy link
Contributor

binste commented Feb 14, 2024

It's always good to get inputs how other people use the library. Where do you see the downside of simply doing alt.Chart(df.reset_index()).encode(x="index")? It's a few more characters (.reset_index()) to type so if it's easy to get rid of it, I'd agree that it's good to do it, but I don't think it is.

  • Index could be accessed without resetting the data_frame (e.g. df.index.values) ?

We need the index as a proper column in the dataframe as Altair then needs to convert the Pandas dataframe to JSON (via a dictionary representation with df.to_dict()).

  • Type of index could be checked ? If its not RangeIndex, consider it as explicit index ?

Yep we could use this: Call .reset_index internally when it's not a RangeIndex. I think this is the best approach so far. We'd need to think through if this has any unintended side-effects. Does Pandas copy the whole dataframe when doing .reset_index? That might use too much memory in some cases and also slow down chart creation.

  • Reset could only be called in the required cases where some expression like "$index" is used ?

It's very tricky to parse all expressions as they can appear in many places. Right now, this requires VegaFusion as mentioned above.

  • Spend extra method or option for indexed dataframes (use_index=True) ) ?

We could do that but it feels easier to just let a user do .reset_index, about the same amount of characters to type.

  • Allow index as type for mapping x = df.index ?

Same reason as with the first suggestion, we need it as a column in the dataset. This is a requirement for generating the Vega-Lite specification (JSON).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants