Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the speed of from_dataframe with a MultiIndex (by 40x!) #4184

Merged
merged 13 commits into from Jul 2, 2020

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jun 26, 2020

Before:

pandas.MultiIndexSeries.time_to_xarray
======= ========= ==========
--             subset
------- --------------------
dtype     True     False
======= ========= ==========
  int    505±0ms   37.1±0ms
 float   485±0ms   38.3±0ms
======= ========= ==========

After:

pandas.MultiIndexSeries.time_to_xarray
======= ============ ==========
--               subset
------- -----------------------
dtype      True       False
======= ============ ==========
  int    10.7±0.4ms   22.6±1ms
 float   10.0±0.8ms   21.1±1ms
======= ============ ==========

There are still some cases where we have to fall back to the existing
slow implementation, but hopefully they should now be relatively rare.
Edit: now we always use the new implementation

Fixes pydataGH-2459

Before:

    pandas.MultiIndexSeries.time_to_xarray
    ======= ========= ==========
    --             subset
    ------- --------------------
    dtype     True     False
    ======= ========= ==========
      int    505±0ms   37.1±0ms
     float   485±0ms   38.3±0ms
    ======= ========= ==========

After:

    pandas.MultiIndexSeries.time_to_xarray
    ======= ========= ==========
    --             subset
    ------- --------------------
    dtype     True     False
    ======= ========= ==========
      int    11.5±0ms   39.2±0ms
     float   12.5±0ms   26.6±0ms
    ======= ========= ==========

There are still some cases where we have to fall back to the existing
slow implementation, but hopefully they should now be relatively rare.
@shoyer shoyer changed the title Improve the speed of from_dataframe with a MultiIndex Improve the speed of from_dataframe with a MultiIndex (by 40x!) Jun 27, 2020
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @shoyer

@dcherian dcherian mentioned this pull request Jun 29, 2020
23 tasks
@shoyer
Copy link
Member Author

shoyer commented Jul 1, 2020

I added a handful of extra commits if anyone wants to take another look.

If not, I'll just merge this in a day or two

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

xarray/core/dataset.py Show resolved Hide resolved
@shoyer shoyer merged commit 03d409e into pydata:master Jul 2, 2020
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 16, 2020
* master:
  Add initial cupy tests (pydata#4214)
  Add 0.16.0 release summary
  New whatsnew section
  Release v0.16.0
  Minor reorg of whatsnew for 0.16.0 (pydata#4216)
  fix sphinx warnings (pydata#4199)
  pin isort (pydata#4206)
  get the colorbar label via public methods (pydata#4201)
  Bump minimum versions for 0.16 release (pydata#4175)
  Allow passing axis kwargs to plot (pydata#4020)
  Fix to_unstacked_dataset for single dimension variables. (pydata#4094)
  Improve the speed of from_dataframe with a MultiIndex (by 40x!) (pydata#4184)
  More pint compatibility: silence UnitStrippedWarnings (pydata#4163)
  Fix typo (pydata#4192)
  use the latest image of RTD (pydata#4191)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants