No simple way to pass initializer for process #381

cjw296 · 2016-07-28T08:32:51Z

multiprocessing.Pool has a handy initializer parameter to pass a callable for setting up per-process resources (database connections, loggers, etc) but joblib doesn't expose a way to pass this.

I see in 0.10 I can pass a custom multiprocessing context, which I hope I can use to achieve this, but per-process setup is likely something that many users will likely need, so would be good if there was an easier way.

(I'd suggest an initializer parameter to Parallel that's picked up by MultiprocessingBackend)

The text was updated successfully, but these errors were encountered:

cjw296 · 2016-07-28T08:38:39Z

...and sadly, context is only available on Python 3.4+, the project I'm working on is Python 2.7 :-(

cjw296 · 2016-07-28T08:42:06Z

Hmm, actually, not even sure passing a context will let you do per-process initialisation...

astromancer · 2017-12-07T08:01:48Z

Custom initializers for processes is a must have feature!

gdlmx · 2019-04-08T16:46:02Z

For who is searching for a dirty workaround, I wrote a simple function to inject additional initializer to Parallel instance with a Loky backend.

def with_initializer(self, f_init):
    hasattr(self._backend, '_workers') or self.__enter__()
    origin_init = self._backend._workers._initializer
    def new_init():
        origin_init()
        f_init()
    self._backend._workers._initializer = new_init if callable(origin_init) else f_init
    return self

Example usage:

import matlab
from joblib import Parallel, delayed

x = matlab.double([[0.0]]) # this object can only be loaded after importing matlab

def f(i):
    print(i, x)

def _init_matlab():
    import matlab

with Parallel(4) as para:
    for _ in with_initializer(para, _init_matlab)(delayed(f)(i) for i in range(10)):
        pass

Data objects of some complex libraries such as matlab can only be loaded after importing the python module. An initializer seems to be the only way to guarantee to load a third party module before the child processes try to unpickle those global data objects.

cwindolf · 2019-04-08T16:48:31Z

(See https://stackoverflow.com/questions/55424095/error-pickling-a-matlab-object-in-joblib-parallel-context for context on the above)

shwina · 2023-11-14T12:41:42Z

Hi -- is this is possible today? If not, would a PR implementing the following suggestion be welcome?

(I'd suggest an initializer parameter to Parallel that's picked up by MultiprocessingBackend)

Alternatively, I imagine it's possible to expose the parameter in joblib.parallel_config?

jakirkham · 2023-11-14T19:30:17Z

What do you think @ogrisel ? 🙂

astromancer · 2023-11-15T08:34:46Z

I've been using this recipe, adapted from the comment above:

Code

from joblib._parallel_backends import MultiprocessingBackend, SequentialBackend


def noop(*_, **__):
    """Do nothing."""


def initialized(self, initializer=noop, args=(), **kws):
    """
    Custom per-process process initialization for `joblib.Parallel`.

    Parameters
    ----------
    initializer : callable, optional
        Your process initializer, by default noop, which does nothing.
    args : tuple, optional
        Parameters for your initializer, by default ()

    Returns
    -------
    joblib.Parallel
    """
    if isinstance(self._backend, SequentialBackend):
        return self

    if isinstance(self._backend, MultiprocessingBackend):
        self._backend_args.update(initializer=initializer, initargs=args)
        return self

    if not hasattr(self._backend, '_workers'):
        self.__enter__()

    workers = self._backend._workers
    original_init = workers._initializer

    def new_init():
        if callable(original_init):
            original_init()

        initializer(*args, **kws)

    workers._initializer = new_init

    return self

Usage

import contextlib as ctx
import multiprocessing as mp
from joblib.parallel import Parallel, delayed


class ContextStack(ctx.ExitStack):
    """Manage nested contexts."""

    def __init__(self, contexts=()):
        super().__init__()
        self.contexts = list(contexts)

    def __enter__(self):
        return next(filter(None, map(self.enter_context, self.contexts)), None)

    def add(self, context):
        # assert isinstance(context, ctx.AbstractContextManager)
        self.contexts.append(context)


# ---------------------------------------------------------------------------- #
# main
memory_lock = mp.Lock()


def set_lock(lock):
    # Initialize each process with a global variable lock.
    print('process setup')
    global memory_lock
    memory_lock = lock


def work(*args, **kws):
    print('doing work:', args, kws)


def get_workload():
    yield from range(10)


njobs = 10
worker = delayed(work)
context = ContextStack()
executor = Parallel(njobs, verbose=10)
context.add(initialized(executor, set_lock, (memory_lock, )))
with context as compute:
    compute(worker(data) for data in get_workload())

[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
process setup
doing work: (0,) {}
doing work: (1,) {}
doing work: (2,) {}
[Parallel(n_jobs=10)]: Done   3 out of  10 | elapsed:    2.6s remaining:    6.0s
doing work: (3,) {}
doing work: (4,) {}
[Parallel(n_jobs=10)]: Done   5 out of  10 | elapsed:    2.6s remaining:    2.6s
doing work: (5,) {}
doing work: (6,) {}
[Parallel(n_jobs=10)]: Done   7 out of  10 | elapsed:    2.6s remaining:    1.1s
doing work: (7,) {}
process setup
doing work: (8,) {}
doing work: (9,) {}
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    2.6s finished
process setup
process setup
process setup
process setup
process setup
process setup
process setup
process setup

shwina · 2023-11-15T10:33:31Z

Thank you @astromancer , that's slick!

I'm hoping to find a solution for when I don't have access to the Parallel() instance. This is the case when, say you're not using joblib directly, but rather using a library that itself uses joblib.

I suppose monkeypatching Parallel is a possibility but as others have said, this would be great to have as part of the "official" joblib API (parallel_config).

ogrisel · 2023-11-15T14:48:06Z

If someone want to work on a PR with at least support for the most common backends (e.g. loky, multiprocessing and maybe dask), I might find the time to review it.

shwina · 2023-11-15T14:49:53Z

Thanks! I'll work on that.

ogrisel · 2023-11-15T14:50:22Z

Note however, that contrary to the multiprocessing backend, the loky and dask backends can reuse workers across consecutive calls to different Parallel instances. So the worker initialization semantics and the documentation should be aware of that.

tueda mentioned this issue Jan 24, 2020

Distributing large input data over worker processes? #996

Open

shwina linked a pull request Nov 20, 2023 that will close this issue

Add option to specify an initialization function for 'loky' and 'multiprocessing' backends #1525

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No simple way to pass initializer for process #381

No simple way to pass initializer for process #381

cjw296 commented Jul 28, 2016

cjw296 commented Jul 28, 2016

cjw296 commented Jul 28, 2016

astromancer commented Dec 7, 2017 •

edited

gdlmx commented Apr 8, 2019

cwindolf commented Apr 8, 2019

shwina commented Nov 14, 2023

jakirkham commented Nov 14, 2023

astromancer commented Nov 15, 2023 •

edited

shwina commented Nov 15, 2023

ogrisel commented Nov 15, 2023

shwina commented Nov 15, 2023

ogrisel commented Nov 15, 2023

No simple way to pass initializer for process #381

No simple way to pass initializer for process #381

Comments

cjw296 commented Jul 28, 2016

cjw296 commented Jul 28, 2016

cjw296 commented Jul 28, 2016

astromancer commented Dec 7, 2017 • edited

gdlmx commented Apr 8, 2019

cwindolf commented Apr 8, 2019

shwina commented Nov 14, 2023

jakirkham commented Nov 14, 2023

astromancer commented Nov 15, 2023 • edited

Code

Usage

shwina commented Nov 15, 2023

ogrisel commented Nov 15, 2023

shwina commented Nov 15, 2023

ogrisel commented Nov 15, 2023

astromancer commented Dec 7, 2017 •

edited

astromancer commented Nov 15, 2023 •

edited