Avoid `pandas` constructors in `dask.dataframe.core` #9570

rjzamora · 2022-10-13T14:07:36Z

There are many places in dask.dataframe where pd.DataFrame/pd.Series constructors are used explicitly. This PR proposes the addition of serial_frame_constructor and serial_series_constrictor utilities that take in an optional like parameter to determine which DataFrame/Series constructor to use (i.e. pandas or cudf). Default is pandas.DataFrame and pandas.Series.

The optional like argument is currently expected to be a serial DataFrame, serial Series, DataFrame collection, or Series collection. It may also make sense to handle a numpy/cupy Array or Array collection (along the lines of @quasiben's suggestion in #11889). Howeer, that feature will probablty require the addition of a new array_to_frame diispatch as well (or something similar).

Closes #11889
Tests added / passed
Passes pre-commit run --all-files

quasiben · 2022-10-13T14:43:35Z

That's a very cool solution to this problem -- and mirror's what @pentschev did for NEP37

jrbourbeau

Thanks @rjzamora. Just left a few quick comments while passing by, I'll take a more detailed look soon

dask/dataframe/utils.py

jrbourbeau · 2022-10-13T15:02:14Z

dask/dataframe/core.py

@@ -7150,7 +7159,7 @@ def cov_corr_combine(data_in, corr=False):
    return out


-def cov_corr_agg(data, cols, min_periods=2, corr=False, scalar=False):
+def cov_corr_agg(data, cols, min_periods=2, corr=False, scalar=False, like_df=None):


What is data here? Can we get the DataFrame type from that?

data is list[dict[str, np/cupy.ndarray]] here. Therfore, we would need to extend serial_frame_constructor to handle array-like data if we want to avoid the like_df argument.

1ba0353 adds a serial_constructor_from_array dispatch (but doesn't actually use it yet) to illustrate what it would probably look like.

…ctually use it yet

jrbourbeau · 2022-10-13T16:54:38Z

dask/dataframe/tests/test_dataframe.py

+@pytest.mark.gpu
+def test_cov_gpu():


If the backend PR goes in before this PR, could we just reuse test_cov with the backend engine parametrized?

Perhaps. We would need to change the _compat dataframe-creation machinery to use dispachable functions to do this.

jrbourbeau · 2022-10-13T16:54:43Z

dask/dataframe/tests/test_dataframe.py

+@pytest.mark.gpu
+def test_corr_gpu():


Similar comment here

dask/dataframe/utils.py

…uctors

ian-r-rose · 2022-10-18T17:50:21Z

I wonder if there are any ways to add a CI check to ensure that future invocations of frame-like constructors go through this path. Nothing lightweight springs to mind, but I imagine it would be fairly easy for me or others to accidentally introduce a pandas constructor when one of these utilities would be more appropriate.

rjzamora · 2022-10-18T18:35:21Z

I imagine it would be fairly easy for me or others to accidentally introduce a pandas constructor when one of these utilities would be more appropriate.

Yea, I also struggled to come up with a clever solutions for this problem (beyond the systematic expansion and maintenance of gpuci). One challenge is that pd.DataFrame/Series usage is not always a problem. In fact, the cudf codebase uses pandas in many places. The primary issue is when a function inadvertently switches from cudf to pandas by assuming the input is pandas-based. This specific case seems a bit difficult to detect in an automated way.

…uctors

jrbourbeau

Circling back here, @rjzamora are there any additional changes you'd like to include here? It'd be nice to include this in the release tomorrow if feasible

jrbourbeau · 2022-11-10T18:15:48Z

dask/dataframe/core.py

@@ -7775,7 +7782,7 @@ def to_datetime(arg, meta=None, **kwargs):
 def to_timedelta(arg, unit=None, errors="raise"):
    if not PANDAS_GT_110 and unit is None:
        unit = "ns"
-    meta = pd.Series([pd.Timedelta(1, unit=unit)])
+    meta = meta_series_constructor(arg)([pd.Timedelta(1, unit=unit)])


(Not a blocking comment, just meant for other reviewers) Noting that this will fail for some inputs (e.g. numpy.ndarrays) that are supported by pandas.to_timedelta. However, it looks like things already fail with dd.to_timedelta for such inputs on main and we arguably get a better error message with the changes in this PR.

On main:

In [1]: import numpy as np In [2]: import dask.dataframe as dd In [3]: dd.to_timedelta(np.arange(5), unit='s') --------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [3], in <cell line: 1>() ----> 1 dd.to_timedelta(np.arange(5), unit='s') File ~/projects/dask/dask/dask/dataframe/core.py:7779, in to_timedelta(arg, unit, errors) 7777 unit = "ns" 7778 meta = pd.Series([pd.Timedelta(1, unit=unit)]) -> 7779 return map_partitions(pd.to_timedelta, arg, unit=unit, errors=errors, meta=meta) File ~/projects/dask/dask/dask/dataframe/core.py:6668, in map_partitions(func, meta, enforce_metadata, transform_divisions, align_dataframes, *args, **kwargs) 6665 if collections: 6666 simple = False -> 6668 divisions = _get_divisions_map_partitions( 6669 align_dataframes, transform_divisions, dfs, func, args, kwargs 6670 ) 6672 if has_keyword(func, "partition_info"): 6673 partition_info = { 6674 (i,): {"number": i, "division": division} 6675 for i, division in enumerate(divisions[:-1]) 6676 } File ~/projects/dask/dask/dask/dataframe/core.py:6711, in _get_divisions_map_partitions(align_dataframes, transform_divisions, dfs, func, args, kwargs) 6707 """ 6708 Helper to get divisions for map_partitions and map_overlap output. 6709 """ 6710 if align_dataframes: -> 6711 divisions = dfs[0].divisions 6712 else: 6713 # Unaligned, dfs is a mix of 1 partition and 1+ partition dataframes, 6714 # use longest divisions found 6715 divisions = max((d.divisions for d in dfs), key=len) IndexError: list index out of range

With this PR:

In [4]: dd.to_timedelta(np.arange(5), unit='s') --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [4], in <cell line: 1>() ----> 1 dd.to_timedelta(np.arange(5), unit='s') File ~/projects/dask/dask/dask/dataframe/core.py:7785, in to_timedelta(arg, unit, errors) 7783 if not PANDAS_GT_110 and unit is None: 7784 unit = "ns" -> 7785 meta = meta_series_constructor(arg)([pd.Timedelta(1, unit=unit)]) 7786 return map_partitions(pd.to_timedelta, arg, unit=unit, errors=errors, meta=meta) File ~/projects/dask/dask/dask/dataframe/utils.py:782, in meta_series_constructor(like) 780 return like.to_frame()._constructor_sliced 781 else: --> 782 raise TypeError(f"{type(like)} not supported by meta_series_constructor") TypeError: <class 'numpy.ndarray'> not supported by meta_series_constructor

dask/dataframe/utils.py

dask/dataframe/core.py

…uctors

dask/dataframe/tests/test_utils_dataframe.py

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

jrbourbeau

Thanks @rjzamora!

rjzamora added 2 commits October 11, 2022 15:28

avoid using pd.DataFrame and pd.Series throughout dask.dataframe.core

a59af4a

add basic test coverage

5ec150d

rjzamora self-assigned this Oct 13, 2022

github-actions bot added the dataframe label Oct 13, 2022

rjzamora added bug Something is broken gpu labels Oct 13, 2022

jrbourbeau reviewed Oct 13, 2022

View reviewed changes

rjzamora added 3 commits October 13, 2022 08:38

simplify logic

91538c8

handle index

dfe1114

add experimental serial_constructor_from_array dispatch - but don't a…

1ba0353

…ctually use it yet

github-actions bot added the dispatch Related to `Dispatch` extension objects label Oct 13, 2022

jrbourbeau reviewed Oct 13, 2022

View reviewed changes

ian-r-rose mentioned this pull request Oct 17, 2022

Avoid FutureWarning that can arise from index dtype inference #9575

Closed

3 tasks

rjzamora added 2 commits October 18, 2022 10:16

Merge remote-tracking branch 'upstream/main' into avoid-pandas-constr…

97ceffd

…uctors

remove from_array dispatch and change names from serial_ to meta_

ee35931

github-actions bot removed the dispatch Related to `Dispatch` extension objects label Oct 18, 2022

fix index problem

8ad7e72

rjzamora marked this pull request as ready for review October 18, 2022 18:10

rjzamora added 3 commits October 18, 2022 11:54

cleanup and narrow scope

5c49d07

narrow scope further

8b88469

Merge remote-tracking branch 'upstream/main' into avoid-pandas-constr…

4e8ed06

…uctors

jrbourbeau mentioned this pull request Nov 10, 2022

Release 2022.11.0 dask/community#287

Closed

7 tasks

jrbourbeau reviewed Nov 10, 2022

View reviewed changes

rjzamora added 2 commits November 10, 2022 10:54

Merge remote-tracking branch 'upstream/main' into avoid-pandas-constr…

61ee8ab

…uctors

add test coverage

a2613f5

jrbourbeau reviewed Nov 10, 2022

View reviewed changes

dask/dataframe/tests/test_utils_dataframe.py Outdated Show resolved Hide resolved

Update dask/dataframe/tests/test_utils_dataframe.py

ed3620f

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

jrbourbeau changed the title ~~Avoid pandas constructors in dask.dataframe.core~~ Avoid pandas constructors in dask.dataframe.core Nov 10, 2022

jrbourbeau approved these changes Nov 10, 2022

View reviewed changes

jrbourbeau merged commit 34a1e88 into dask:main Nov 10, 2022

rjzamora deleted the avoid-pandas-constructors branch November 10, 2022 22:10

jrbourbeau mentioned this pull request May 2, 2023

Adjust for DataFrame.applymap deprecation and all NA concat behaviour change #10245

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid `pandas` constructors in `dask.dataframe.core` #9570

Avoid `pandas` constructors in `dask.dataframe.core` #9570

rjzamora commented Oct 13, 2022 •

edited

quasiben commented Oct 13, 2022

jrbourbeau left a comment

jrbourbeau Oct 13, 2022

rjzamora Oct 13, 2022

rjzamora Oct 13, 2022

jrbourbeau Oct 13, 2022

rjzamora Oct 13, 2022

jrbourbeau Oct 13, 2022

ian-r-rose commented Oct 18, 2022

rjzamora commented Oct 18, 2022

jrbourbeau left a comment

jrbourbeau Nov 10, 2022

jrbourbeau left a comment

Avoid pandas constructors in dask.dataframe.core #9570

Avoid pandas constructors in dask.dataframe.core #9570

Conversation

rjzamora commented Oct 13, 2022 • edited

quasiben commented Oct 13, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Oct 13, 2022

Choose a reason for hiding this comment

rjzamora Oct 13, 2022

Choose a reason for hiding this comment

rjzamora Oct 13, 2022

Choose a reason for hiding this comment

jrbourbeau Oct 13, 2022

Choose a reason for hiding this comment

rjzamora Oct 13, 2022

Choose a reason for hiding this comment

jrbourbeau Oct 13, 2022

Choose a reason for hiding this comment

ian-r-rose commented Oct 18, 2022

rjzamora commented Oct 18, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Nov 10, 2022

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Avoid `pandas` constructors in `dask.dataframe.core` #9570

Avoid `pandas` constructors in `dask.dataframe.core` #9570

rjzamora commented Oct 13, 2022 •

edited