You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently support heterogeneous renumbering in the C++ sampling API. This issue is for extending that to heterogeneous graphs. Fundamentally, the process is the same with a few differences:
Input:
A heterogeneous graph can have multiple node types
BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.
A heterogeneous graph can have multiple edge types
AND each edge type has its own numbering from 0 to #edges_of_type (cuGraph-PyG/cuGraph-DGL don't renumber these)
So from the libcugraph perspective, there may be multiple edges with id 0, but there can only be one id 0 per edge type
For each edge type, there is a single input node type and output node type
i.e. (author, writes, paper) has "author" as the input node type and "paper" as the output node type.
The user will be required to provide a mapping of edge type id to src/dst node type id.
Output:
Every edge type and every node type has its own renumbering for each output minibatch
i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
This obviously means that each node type needs a renumber map.
Each edge type corresponds to a separate subgraph in the output minibatch.
libcugraph needs to output a list of offsets for where each edge type graph starts and ends
If there is no output for a particular edge type, it should still be noted (i.e. the offsets [0,30,30,70] means that edge type 1 had no output edges in the current minibatch.
Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.
In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.
We also use the user-provided node offsets array to subtract off the starting offset for each node type in each node's renumber map
Other Notes:
We are probably only going to expose this through pylibcugraph for the time being.
We should make sure the node offsets requirement is well-documented since it requires an extra renumbering step that cuGraph-DGL/cuGraph-PyG does (even though this is trivial and already implemented by both frameworks)
For MG, the node offsets and edge mapping are global. Each worker will get the exact same values for each array. These should be respected globally when doing renumbering. No concatenation should occur. I suppose we could have an expensive check that verifies these arrays are identical across workers.
The text was updated successfully, but these errors were encountered:
We currently support heterogeneous renumbering in the C++ sampling API.
Do we? What does heterogenous sampling mean for non-heterogeneous graph?
A heterogeneous graph can have multiple node types
BUT we can always assume they will be renumbered from 0 to N by cuGraph-PyG and cuGraph-DGL
We can also assume that each node type has a starting and ending offset (i.e. type 0 is nodes 0-30, type 1 is nodes 31-50, etc.) and that the user must provide this in sampling.
So, we assume external vertex IDs are in [0, N) where N is # nodes (i.e. vertices). cuGraph internally renumbers vertex IDs, so we may keep nodes types as node property values to quickly query vertex types. Does this align with your understanding?
Every edge type and every node type has its own renumbering for each output minibatch
i.e. "paper" has node ids from 0 to #papers_in_minibatch - 1 and "author" has node ids from 0 to #authors_in_minibatch - 1
These node numberings are consistent across edge types. We only renumber the nodes once, globally, or per hop in the DGL case.
This obviously means that each node type needs a renumber map.
Are we renumbering just node (vertex) IDs or renumbering both node IDs and edge IDs?
Renumbering for different batches are independent (as before), correct?
We have separate node (vertex) renumber map per node type & batch, correct?
"We only renumber the nodes once, globally, or per hop" means within batch, right?
"per hop in the DGL case", this does not mean we have separate renumber map per hop, right? We currently have compress_per_hop option but does not renumber per hop (but we consider hop # a vertex first appears in renumbering).
Each edge type needs a renumber map. These renumber maps are created the same way they are for nodes.
What do you mean by this? This doesn't mean node renumber map, correct?
In the case of a separate graph outputted per hop (DGL), every hop would still need a separate renumbering, just as it is now.
We currently do not renumber separately per hop, but considers hop number in generating node renumbering map.
@seunghwak we can start out just supporting COO for now and figure out CSR later. But I believe we would have a separate CSR per edge type.
Every edge type has a source node type and a destination node type. All edges of a type have the same source/destination type. So the CSR for an edge type only ever has one source node type and one destination node type.
We currently support heterogeneous renumbering in the C++ sampling API. This issue is for extending that to heterogeneous graphs. Fundamentally, the process is the same with a few differences:
Input:
Output:
Other Notes:
The text was updated successfully, but these errors were encountered: