Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Heterogeneous Renumbering in the C++ API #4412

Open
Tracked by #4299
alexbarghi-nv opened this issue May 10, 2024 · 3 comments
Open
Tracked by #4299

Support Heterogeneous Renumbering in the C++ API #4412

alexbarghi-nv opened this issue May 10, 2024 · 3 comments
Assignees
Labels
feature request New feature or request
Milestone

Comments

@alexbarghi-nv
Copy link
Member

We currently support heterogeneous renumbering in the C++ sampling API. This issue is for extending that to heterogeneous graphs. Fundamentally, the process is the same with a few differences:

Input:

  • A heterogeneous graph can have multiple node types
    • BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
    • We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.
  • A heterogeneous graph can have multiple edge types
    • AND each edge type has its own numbering from 0 to #edges_of_type (cuGraph-PyG/cuGraph-DGL don't renumber these)
    • So from the libcugraph perspective, there may be multiple edges with id 0, but there can only be one id 0 per edge type
  • For each edge type, there is a single input node type and output node type
    • i.e. (author, writes, paper) has "author" as the input node type and "paper" as the output node type.
    • The user will be required to provide a mapping of edge type id to src/dst node type id.

Output:

  • Every edge type and every node type has its own renumbering for each output minibatch
    • i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
    • These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
    • This obviously means that each node type needs a renumber map.
  • Each edge type corresponds to a separate subgraph in the output minibatch.
    • libcugraph needs to output a list of offsets for where each edge type graph starts and ends
    • If there is no output for a particular edge type, it should still be noted (i.e. the offsets [0,30,30,70] means that edge type 1 had no output edges in the current minibatch.
    • Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.
    • In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.
  • We also use the user-provided node offsets array to subtract off the starting offset for each node type in each node's renumber map

Other Notes:

  • We are probably only going to expose this through pylibcugraph for the time being.
  • We should make sure the node offsets requirement is well-documented since it requires an extra renumbering step that cuGraph-DGL/cuGraph-PyG does (even though this is trivial and already implemented by both frameworks)
  • For MG, the node offsets and edge mapping are global. Each worker will get the exact same values for each array. These should be respected globally when doing renumbering. No concatenation should occur. I suppose we could have an expensive check that verifies these arrays are identical across workers.
@seunghwak
Copy link
Contributor

We currently support heterogeneous renumbering in the C++ sampling API.

Do we? What does heterogenous sampling mean for non-heterogeneous graph?

A heterogeneous graph can have multiple node types
BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.

So, we assume external vertex IDs are in [0, N) where N is # nodes (i.e. vertices). cuGraph internally renumbers vertex IDs, so we may keep nodes types as node property values to quickly query vertex types. Does this align with your understanding?

Every edge type and every node type has its own renumbering for each output minibatch
i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
This obviously means that each node type needs a renumber map.

Are we renumbering just node (vertex) IDs or renumbering both node IDs and edge IDs?
Renumbering for different batches are independent (as before), correct?
We have separate node (vertex) renumber map per node type & batch, correct?
"We only renumber the nodes once, globally, or per hop" means within batch, right?
"per hop in the DGL case", this does not mean we have separate renumber map per hop, right? We currently have compress_per_hop option but does not renumber per hop (but we consider hop # a vertex first appears in renumbering).

Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.

What do you mean by this? This doesn't mean node renumber map, correct?

In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.

We currently do not renumber separately per hop, but considers hop number in generating node renumbering map.

@seunghwak
Copy link
Contributor

Each edge type corresponds to a separate subgraph in the output minibatch.

And I am not sure how our compress_per_hop option should work with this.

If there are multiple node types, major ranges won't be well separated unless we create separate CSR outputs per major node type.

@alexbarghi-nv
Copy link
Member Author

@seunghwak we can start out just supporting COO for now and figure out CSR later. But I believe we would have a separate CSR per edge type.

Every edge type has a source node type and a destination node type. All edges of a type have the same source/destination type. So the CSR for an edge type only ever has one source node type and one destination node type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants