Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More repeatability questions for cugraph.leiden #4373

Closed
ChuckHastings opened this issue Apr 24, 2024 · 3 comments
Closed

More repeatability questions for cugraph.leiden #4373

ChuckHastings opened this issue Apr 24, 2024 · 3 comments
Assignees

Comments

@ChuckHastings
Copy link
Collaborator

Follow up from issue #4072.

          @ChuckHastings, thank you for working on this! I've just tested with version `24.06.00a43`. The issue is resolved completely when the adjacency matrix is `float64`, but there is still some variability, although much less, when it's `float32` and the graph is large (75,000 nodes). I'm attaching the zipped [npz file](https://github.com/rapidsai/cugraph/files/15093902/adjacency_75000.zip) of the adjacency matrix. Here are the results I'm getting from 10 runs, as before:
Min. modularity (float32): 0.8989160060882568
Mean modularity (float32): 0.8994227409362793
Max. modularity (float32): 0.9014488458633423

Min. partition count (float32): 67
Max. partition count (float32): 68

[[1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan 0.8200713715 1.0000000000 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan 0.8200713715 0.8200713715 0.8200713715 1.0000000000 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan 1.0000000000 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan 1.0000000000 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan 0.8200713715 1.0000000000 1.0000000000]
 [         nan          nan          nan          nan          nan          nan          nan 0.8200713715 0.8200713715]
 [         nan          nan          nan          nan          nan          nan          nan          nan 1.0000000000]]

Min. adjusted Rand index (float32): 0.8200713715385887
Mean adjusted Rand index (float32): 0.9360253765470538
Max. adjusted Rand index (float32): 1.0

Since this is probably a different problem than the dendrogram flattening that you fixed, should I open a new issue? Or can you reopen this one?

Originally posted by @jpintar in #4072 (comment)

@ChuckHastings
Copy link
Collaborator Author

Quick first observation of this. I have run this several times. For 64-bit I am always seeing everything align (as you indicate in your post). For 32-bit I am seeing some variability. Sometimes I get matching answers, sometimes I get differences. So this behavior (in 32-bit) is still non-deterministic.

I wonder if we are seeing floating point rounding errors causing different choices in the clustering decisions. I will attempt to isolate why we are seeing differences.

@ChuckHastings ChuckHastings self-assigned this Apr 29, 2024
@ChuckHastings
Copy link
Collaborator Author

I have validated that there are issues related to numerical instability that are causing repeatability issues for 32-bit weights.

I would note that if your graph is large enough you'll probably see these in 64-bit weights as well, although your graph would need to be pretty large, I would think.

Our implementation of Leiden/Louvain executes computations in parallel, there is no guarantee that the order of additions of floating point numbers will be consistent from call to call. If we have a cluster weight on the order of 1e-5, and we have an edge weight on the order of 1e-11, the order in which the additions occur will have subtle changes to the result. As an example that I saw in this data:

  1. We computed the delta modularity for vertex v to move into cluster c1 as 5.08771e-5, and the delta modularity to move into cluster c2 was 5.08771e-5 (with c1 < c2). A tie (the difference between them was exactly 0). Our tie breaking procedure is to pick the smaller cluster id, so we put the vertex v into cluster c1.
  2. In a separate the computation of delta modularity for c2 we still get 5.08771e-5, but for c1 we get 5.08770e-5. The additions must have included one of the larger values before some of the much smaller values, and because we were dealing with a 32-bit float we lost the addition of the smaller values. In this case, delta modularity for c2 is larger than c1, so we decide to move the vertex v into cluster c2.

This results in a change in the entire sequence of decisions for the greedy Louvain and Leiden, so we end up with a different (sometimes radically different) clustering choice... but pretty good modularity score.

We can look into some changes to address this. But this would definitely be a longer term research problem.

I would suggest that if repeatability is a priority for you, using the 64-bit double should dramatically reduce the change of instability in the computation.

@jpintar
Copy link

jpintar commented May 3, 2024

Thank you for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants