Pagerank on Distributed Dask #9721

Manoj-red-hat · 2022-12-06T18:36:03Z

Manoj-red-hat
Dec 6, 2022

Hi everyone,

I am trying to understand Dask for some graph algorithms, for the same I am trying to write a page-rank in dask using COO graph format.

Question : Do we have some library already developed on dask for graph algos except cuGraph its GPU based and metagraph looks its development already stopped.

What approach should I take for writing pagerank on DASK ?

eriknw · 2022-12-07T13:30:44Z

eriknw
Dec 7, 2022

Hi Manoj, great question!

You could try following the (somewhat sparse) directions for creating sparse dask arrays backed by pydata/sparse or scipy.sparse: https://docs.dask.org/en/latest/array-sparse.html

Another option is to try using dask-grblas. @jim22k, @SultanOrazbayev, and I (and maybe @ParticularMiner) are working to bring dask-grblas up-to-date and plan to continue developing it next year. PageRank is an "easy" algorithm that should hopefully be straightforward to get working with dask-grblas (but this may need our help--things are a little messy atm, sorry!). @jim22k, do we have a PageRank example with dask-grblas, and is it public?

fyi, dask-grblas will soon be renamed to dask-graphblas. It is backed by GraphBLAS via python-graphblas, which can be very fast compared to scipy.sparse and pydata/sparse, and more expressive so it can handle even more graph algorithms.

I think we ought to have an example notebook for distributed PageRank with dask-grblas. @Manoj-red-hat, you are welcome to join our weekly community call (and to suggest an alternative time) for python-graphblas to discuss this further: python-graphblas/python-graphblas#247

2 replies

Manoj-red-hat Dec 7, 2022
Author

Hi @eriknw , thanks I also want to be part of this community,

Recently I tried to do benchmarking of pagerank, louvain & cosine similarity
on spark/tigergraph/cugraph/custom c++ code and now on dask
What I found performance cugraph > c++> spark > tigergraph

But if we don't consider here hetrogenous devices like GPU/FPGA

than the best comes out standalone c++ (limitation single node).

When I read about dask+graphblas, this combination looks like more scalable
and I am just curious to figure out, how it behaves on this graph algorithms.

Final comparision which I am planning
Dask 4 node vs spark 4 node vs tigergraph 4 node --> pagerank , louvain & cosine similarity

eriknw Dec 7, 2022

Cool! Welcome aboard :)

I'm super-curious to see the final results of the benchmarks. I'll do what I can to help, but I'll be on vacation the next two weeks and will be mostly unavailable. Feel free to raise issues as appropriate or join us on Discord for chatting.

SultanOrazbayev · 2022-12-07T14:11:32Z

SultanOrazbayev
Dec 7, 2022

@Manoj-red-hat you probably have seen this, but nx-scipy's implementation could be a starting point:
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank_scipy.html

The tricky part is that dask arrays are lazy, so checking convergence will require triggerring compute. Unless there are reasons why you'd want to perform max_iter iterations.

This also is probably going to be OK for smaller proof of concept code, but for larger dask arrays this might not scale well due to lack of sparse-awareness in dask arrays, see this issue #7652

0 replies

jim22k · 2022-12-07T15:10:56Z

jim22k
Dec 7, 2022

Here is an implementation of pagerank that works with dask-grblas. It doesn't handle personalization, but covers the basic algorithm.

import dask_grblas as dgb
import grblas as gb

def is_converged(xprev, x, tol):
    """Check convergence, L1 norm: err = sum(abs(xprev - x)); err < N * tol

    This modifies `xprev`.
    """
    xprev << gb.binary.minus(xprev | x, require_monoid=False)
    xprev << gb.unary.abs(xprev)
    err = xprev.reduce().value.compute()
    return err < xprev.size * tol

alpha=0.85
max_iter=100
tol=1e-06

A = dgb.Matrix.from_values([0, 1, 1], [0, 0, 1], [1., 2., 3.])

N = A.nrows

# Initial vector
x = dgb.Vector.new(float, N)
x[:] = 1.0 / N

# Personalization vector or scalar
p = 1.0 / N

# Inverse of row_degrees
# Fold alpha constant into S
row_degrees = A.reduce_rowwise("plus")
S = (alpha / row_degrees).new()

semiring = gb.op.plus_times[float]

is_dangling = S.nvals.compute() < N
if is_dangling:
    dangling_mask = Vector.new(float, N)
    dangling_mask(mask=~S.S) << 1.0
    # Fold alpha constant into dangling_weights (or dangling_mask)
    # Fast case (and common case); is iso-valued
    dangling_mask(mask=dangling_mask.S) << alpha * p

# Fold constant into p
p *= 1 - alpha

# Power iteration: make up to max_iter iterations
xprev = dgb.Vector.new(float, N)
w = dgb.Vector.new(float, N)
for _ in range(max_iter):
    xprev, x = x, xprev

    # x << alpha * ((xprev * S) @ A + "dangling_weights") + (1 - alpha) * p
    x << p
    if is_dangling:
        # Fast case: add a scalar; x is still iso-valued (b/c p is also scalar)
        x += xprev @ dangling_mask
    w << xprev * S
    x += semiring(w @ A)  # plus_first if A.ss.is_iso else plus_times

    if is_converged(xprev, x, tol):  # sum(abs(xprev - x)) < N * tol
        break
else:
    print('Convergence failure')

x.compute()

6 replies

eriknw Dec 7, 2022

Question 1

Yes, the algorithm @jim22k shared can be run on a Dask cluster. I would suggest that x be persisted (i.e., x = x.persist()) before calling is_converged to avoid repeated computation.

Question 2

Dask handles communication (see the answer to question 3), and the objects in dask-grblas behave like normal Dask objects. For example, operations build up DAGs, and you trigger compute via persist (keep on cluster) or compute (send back to the client).

Question 3

dask-grblas partitions the adjacency matrix of the graph. It relies on dask.array to manage handling of partitions and supports the full flexibility of dask.array partitioning. We don't do anything to choose good partitioning (right now we leave this to users), so there are plenty of opportunities for dask-grblas to be smarter and for tuning.

Manoj-red-hat Dec 8, 2022
Author

@jim22k on running above code I am getting below error

looks like we can't peorform scaler division on delayed object

row_degrees = A.reduce_rowwise("plus") S = (alpha / row_degrees).new()

If I modify above code just for testing, like
row_degrees = A.reduce_rowwise("plus") S = row_degrees.new()

then I am getting below error

jim22k Dec 8, 2022

You're probably running dask-grblas 0.0.1. There is a newer version that fixes those issues.
conda install -c conda-forge jim22k::dask-grblas==0.0.2 conda-forge::grblas==2022.4.0

If you installed using pip+pypi, dask-grblas 0.0.2 isn't pushed there, so the easiest way is probably to clone the github repo and python setup.py develop to get the latest changes.

Note that dask-grblas has been left in limbo for the past year. grblas was renamed to python-graphblas and many new changes were added. dask-grblas was left in a half-finished state. The plan is to circle back and update it with similar changes, but no guarantees on when that will happen.

Manoj-red-hat Dec 8, 2022
Author

Thanks @jim22k it works, I understand that dask-grblas is in limbo state, but on the basis of below
Benchmark
Dask 4 node vs spark 4 node vs tigergraph 4 node --> pagerank , louvain & cosine similarity

If dask-grblas outperform than I will start working on this project on my free time. I might need your guys help on that, but don't worry as soon as my understanding increases will start as individual contributor for this project

Manoj-red-hat Dec 8, 2022
Author

How to submit above page-rank code to a dask cluster client?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagerank on Distributed Dask #9721

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pagerank on Distributed Dask #9721

Manoj-red-hat Dec 6, 2022

Replies: 3 comments · 8 replies

eriknw Dec 7, 2022

Manoj-red-hat Dec 7, 2022 Author

eriknw Dec 7, 2022

SultanOrazbayev Dec 7, 2022

jim22k Dec 7, 2022

eriknw Dec 7, 2022

Manoj-red-hat Dec 8, 2022 Author

jim22k Dec 8, 2022

Manoj-red-hat Dec 8, 2022 Author

Manoj-red-hat Dec 8, 2022 Author

Manoj-red-hat
Dec 6, 2022

Replies: 3 comments 8 replies

eriknw
Dec 7, 2022

Manoj-red-hat Dec 7, 2022
Author

SultanOrazbayev
Dec 7, 2022

jim22k
Dec 7, 2022

Manoj-red-hat Dec 8, 2022
Author

Manoj-red-hat Dec 8, 2022
Author

Manoj-red-hat Dec 8, 2022
Author