-
Hi, all. I'm trying to do some GPU computation on workers (with pycuda), but control the computation from a GPU-less node. The problem is that I can only import pycuda on the workers, not on the main node... and yet whatever I do, dask really insists on importing on the main node too; this of course fails. Here's an example: import dask
import dask.distributed
import socket
import dask_jobqueue
cluster = dask_jobqueue.SLURMCluster(cores=1, memory="2GB", queue="gpu", job_extra=["--gpus=1"])
cluster.scale(1)
from dask.distributed import Client
client = Client(cluster)
def f(x):
import pycuda.driver as cuda
return cuda.mem_get_info()
print(client.gather(client.map(f, range(10)))) Running this gives me:
So my question is: is there a way to make it so that pycuda is never imported on the main node, only on the workers? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
It is generally recommended that the client and the worker nodes share the same environment. The scheduler node is the only one that is really allowed to differ. See https://distributed.dask.org/en/latest/protocol.html for more information. That being said, I am wondering if you could write a python file and then pass that file to the workers and call that on the workers. That way the client doesn't have to be responsible for serializing the function definition. |
Beta Was this translation helpful? Give feedback.
-
I've been using afar for this recently |
Beta Was this translation helpful? Give feedback.
-
I think I've figured out what's going on in the example. Dask first pickles the code of the function using cloudpickle, and by default it inserts modules by reference rather than by value -- so at this point no extra modules are imported on the main node, the code is successfully transferred and executed on the workers. Then because I missed The solution is to make sure that neither values nor exceptions belonging to try:
...
except Exception as e:
raise Exception(str(e)) ... so that I would still get the original exception data, just not as a |
Beta Was this translation helpful? Give feedback.
I think I've figured out what's going on in the example. Dask first pickles the code of the function using cloudpickle, and by default it inserts modules by reference rather than by value -- so at this point no extra modules are imported on the main node, the code is successfully transferred and executed on the workers. Then because I missed
pycuda.init()
, an exception is raised on the workers; this exception is pickled and moved back to the main node. Because the exception ispycuda.Something
, unpickling it requires importing the parent modulepycuda
-- and this is where it fails.The solution is to make sure that neither values nor exceptions belonging to
pycuda
are ever transferred bac…