Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Sampling in cuGraph-PyG #4384

Open
wants to merge 112 commits into
base: branch-24.06
Choose a base branch
from

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented May 1, 2024

Distributed sampling in cuGraph-PyG. Also renames the existing API to clarify that it is dask based.
Adds a dependency on tensordict for cuGraph-PyG which supports the new TensorDictFeatureStore.
Also no longer installs torch-cluster and torch-spline-conv in CI for testing since that results in an ImportError and neither of those packages are needed.

Requires PyG 2.5. Should be merged after #4335

Merge after #4355

Closes #4248
Closes #4249
Closes #3383
Closes #3942
Closes #3836
Closes #4202
Closes #4051
Closes #4326
Closes #4252
Partially addresses #3805

seunghwak and others added 30 commits April 2, 2024 17:28
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of this file was migrated to graph_sage_mg.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSR is the fastest compression based on benchmarking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MNMG example will be added along with the new WG feature store, and will use the new WG feature store to avoid replicating data across processes.

@alexbarghi-nv alexbarghi-nv marked this pull request as ready for review May 17, 2024 20:31
@alexbarghi-nv alexbarghi-nv requested review from a team as code owners May 17, 2024 20:31
Copy link
Member

@tingyu66 tingyu66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some comments and questions. The overall logic looks reasonable and trust you with matching the detailed interface of PyG's native Graph- and FeatureStore.

python/cugraph-pyg/cugraph_pyg/examples/gcn_dist_snmg.py Outdated Show resolved Hide resolved
sz = torch.tensor(num_vertices[vtype], device="cuda")
torch.distributed.all_reduce(sz, op=torch.distributed.ReduceOp.MAX)
num_vertices[vtype] = int(sz)
return num_vertices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache num_vertices and num_edges since they require collective communications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cache the offsets, which are more frequently used, but I might add caching the vertex and edge counts if they are used more often. So maybe in the future.

@BradReesWork
Copy link
Member

/merge

@alexbarghi-nv
Copy link
Member Author

CI failed with bus error again, I think due to OOM. I just pushed a commit that skips that test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment