Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make cudf.pandas not perform redundant CPU<->GPU transfers if there is no in-place write operations #15670

Open
galipremsagar opened this issue May 6, 2024 · 0 comments
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request

Comments

@galipremsagar
Copy link
Contributor

galipremsagar commented May 6, 2024

Is your feature request related to a problem? Please describe.
In cudf.pandas we currently move dataframes from CPU to GPU or vice-versa for every step entirely. We can avoid performing transfers all the time by storing the dataframe in both memories and spending time in CPU<->GPU transfers if there are no in-place operations on the frames.

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd

In [3]: df = pd.read_parquet(
   ...:     "nyc_parking_violations_2022.parquet",
   ...:     columns=["Registration State", "Violation Description", "Vehicle Body Type", "Issue Date", "Summons Number"]
   ...: )

In [4]: %time df.count(axis=0)
CPU times: user 1.41 ms, sys: 4.35 ms, total: 5.75 ms
Wall time: 5.15 ms
Out[4]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

In [5]: %time df.count(axis=1)
CPU times: user 15.7 s, sys: 1.85 s, total: 17.5 s
Wall time: 16.8 s
Out[5]: 
0           5
1           5
2           5
3           5
4           5
           ..
15435602    5
15435603    5
15435604    5
15435605    5
15435606    5
Length: 15435607, dtype: int64

In [6]: %time df.count(axis=0)
CPU times: user 24 s, sys: 2.43 s, total: 26.4 s
Wall time: 25.3 s
Out[6]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

In [7]: %time df.count(axis=0)
CPU times: user 0 ns, sys: 3.08 ms, total: 3.08 ms
Wall time: 2.75 ms
Out[7]: 
Registration State       15435607
Violation Description    15435607
Vehicle Body Type        15435607
Issue Date               15435607
Summons Number           15435607
dtype: int64

Notice the df.count(axis=0) in cell 6 taking quite a bit of time to move from CPU to GPU, we can avoid this.

Describe the solution you'd like
Maintain two identical copies of dataframe - one in GPU, another in CPU.

@galipremsagar galipremsagar added the feature request New feature or request label May 6, 2024
@galipremsagar galipremsagar added this to the Proxying - cudf.pandas milestone May 6, 2024
@galipremsagar galipremsagar added the cudf.pandas Issues specific to cudf.pandas label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

2 participants
@galipremsagar and others