Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation step fails when using shared memory with multiprocessing.managers.BaseManager #28899

Open
ElenaKhaustova opened this issue Apr 26, 2024 · 1 comment
Labels

Comments

@ElenaKhaustova
Copy link

ElenaKhaustova commented Apr 26, 2024

Describe the bug

Original issue: kedro-org/kedro#3674

Relates to #28781

We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing ValueError: cannot set WRITEABLE flag to True of this array error when objects are retrieved from shared memory and passed to scikit-learn functions, for example fit, including this validation step.

The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.

Steps/Code to Reproduce

Some findings:

from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


class MemoryDataset:
    def __init__(self):
        self._ds = None

    def save(self, ds):
        self._ds = ds

    def load(self):
        return self._ds


def train_model(dataset: MemoryDataset) -> LinearRegression:
    regressor = LinearRegression()
    X_train, y_train = dataset.load()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


class MyManager(BaseManager):
    pass


MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))


def main():
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train = pd.Series(rng.random(n_samples))
    # Replacing pd.Series with pd.DataFrame solves the issue
    # y_train = pd.DataFrame(rng.random((n_samples, 1)), columns=list('E'))

    futures = set()

    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset.save((X_train, y_train))

    with ProcessPoolExecutor(max_workers=1) as pool:
        futures.add(pool.submit(train_model, dataset))

Expected Results

No error is thrown.

Actual Results

Traceback (most recent call last):
  File "/pr-scikit-learn/main.py", line 48, in train_model
    regressor.fit(X_train, y_train)
  File "/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1282, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1292, in _check_y
    y = check_array(
        ^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1100, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array

Versions

System:
    python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ]
executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.5.dev0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.2.2
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Nehalem

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Nehalem
@ElenaKhaustova ElenaKhaustova added Bug Needs Triage Issue requires triage labels Apr 26, 2024
@jeremiedbb
Copy link
Member

Thanks for the report @ElenaKhaustova.

We should not try to force making the array writeable, because it's just not always possible. There's ongoing discussion in #28824 which should clarify when we need a writeable array and then make it properly.

@jeremiedbb jeremiedbb removed the Needs Triage Issue requires triage label Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants