Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Trigger a copy when copy=False and X is read-only #28824

Open
jeremiedbb opened this issue Apr 12, 2024 · 9 comments
Open

RFC Trigger a copy when copy=False and X is read-only #28824

jeremiedbb opened this issue Apr 12, 2024 · 9 comments
Labels

Comments

@jeremiedbb
Copy link
Member

jeremiedbb commented Apr 12, 2024

Highly related to #14481 and maybe a little bit to #13986.

My understanding of the copy=False parameter of estimators is "allow inplace modifications of X".

When avoiding a copy is not possible (X doesn't have the right dtype or memory layout for instance), a copy is still triggered. I believe that X being read-only is a valid reason for still triggering a copy.

My main argument is that the user isn't always in control of the permissions of an input array within the whole pipeline. Especially when joblib parallelism is enabled, which may create read-only memmaps. We've have a bunch of issues because of that, the latest being #28781. And it's poorly tested because it requires big arrays which we try to avoid in the tests (although joblib 1.13 makes it easy to trigger with small arrays).

I wouldn't make check_array(copy=False) always trigger a copy when X is read-only because the semantic of the copy param of check_array is not the same as the one of estimators. We could introduce a new param in check_array, like copy_if_readonly ?

  • Estimator has no copy param (i.e.) doesn't intend to do inplace modification:
    check_array(copy=False, copy_if_readonly=False)
  • Estimator has copy param:
    check_array(copy=self.copy, copy_if_readonly=True)

It could also be a third option for copy in check_array: True, False, "if_readonly":

  • Estimator has no copy param:
    check_array(copy=False)
  • Estimator has copy param:
    check_array(copy=self.copy or "if_readonly")
@adrinjalali
Copy link
Member

if_readonly makes sense to me, and the estimator would need to know if inplace operations are going to be performed or not.

cc @amueller @thomasjpfan

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 16, 2024

I'm okay with check_array(copy=self.copy or "if_readonly").

Aside: I alway disliked the semantics of how copy=False can sometimes copy. I rather it raise an error if a copy is triggered because of dtype or order, but it follows np.array semantics so I'm okay with keeping the status quo. (I wish there was a copy="if_required")

@jeremiedbb
Copy link
Member Author

I see a preference toward check_array(copy=True/False/"if_readonly"). I'd like to try to convince you to have a separate parameter instead of a third copy option.

I see writeable as another property of the resulting array that we want to enforce, exactly like order and dtype. And both are params of check_array. The semantic of copy=False already being don't copy unless a conversion is required would be preserved. To be more in line with copy and order, the new param could be writeable=True/False.

If I managed to convince you, great, otherwise I'm fine with copy="if_reandonly" and I'll move forward with this.

@betatim
Copy link
Member

betatim commented Apr 18, 2024

I'm not sure what I think regarding third value vs new argument. It feels like copy=False is already confusing enough that adding copy='if_readonly' will make it even harder to understand? Which would be a reason for an additional argument. However check_array(copy=False, copy_if_readonly=True) is also confusing. At least without reading docs/this discussion I'd have a hard time telling you if this will lead to a copy or not (which of the args wins?).

My main argument is that the user isn't always in control of the permissions of an input array within the whole pipeline. Especially when joblib parallelism is enabled, which may create read-only memmaps.

It seems useful to not raise an error in these cases.

After thinking about this a bit more: what is the problem we are trying to solve?

#13986 makes me think we are trying to avoid bugs (copy=True if you don't actually need it sounds like a bug to me). The discussion in #14481 seems to be more about what copy=False really means. So yeah, what is the problem we are trying to solve?

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Apr 18, 2024

My main argument is that the user isn't always in control of the permissions of an input array within the whole pipeline. Especially when joblib parallelism is enabled, which may create read-only memmaps.

It seems useful to not raise an error in these cases.

After thinking about this a bit more: what is the problem we are trying to solve?

This is exactly the main problem I want to solve: Stop having to worry that a read-only array being created at some point can cause an error. To me it should be covered by the fact that copy=False means don't copy unless required.

One way of ending up with a read-only array is with joblib's memmapping. It caused many issues, the last one being #28781.

Thinking more about it, maybe there are 2 things to consider here:

  • How check_array deals with read-only memmaps

    -> We need to make a copy because we can't make a read-only memmap writeable.

  • How check_array deals with other read-only array-likes

    -> We don't need to make a copy, we just need to make it writeable.
    Here the question would be: if a user passes a read-only array and copy=False to an estimator, should we make it writeable or raise an error ?

In this regard, I think a new parameter is more appropriate. I would propose writeable=True/None:

  • None is the default and means leave the output array as is, no guarantee about it being writeable or not.
  • True means
    • if copy=True, just make sure the copy is writeable
    • if copy=False:
      • if a copy must be made to make it writeable (e.g. read-only memmap), make a copy
      • if it can be made writeable without copy, I'm not sure here between doing it, make a copy or raise an error
  • There's no False option. I don't see a single situation where we need to turn an array read-only. I don't see writeable as a boolean but more as a toggle.

@jeremiedbb
Copy link
Member Author

So yeah, what is the problem we are trying to solve?

Thanks for the question @betatim ! At first I was frustrated, feeling that I had already explained it precisely. But in fact it made me think deeper and find more precisely the underlying details of this issue.

@thomasjpfan
Copy link
Member

For my understanding: If we introduce writable=True/None, can you link examples in the code where we call check_array that will be updated to use writable=True/None?

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Apr 30, 2024

Essentially in the _validate_data of all estimators that have a copy parameter because they do inplace operations.

class InplaceEstimator:  # e.g StandardScaler (mostly transformers but not only)
    def __init__(self, copy=True):
        self.copy = copy

    def fit(self, X, y):
        # This estimator performs inplace operations, we need a writeable array
        self._validate_data(X, y, copy=self.copy, writeable=True)


class NotInplaceEstimator:  # e.g. LogisticRegression
    def __init__(self):
        pass

    def fit(self, X, y):
        # This estimator only reads the input data, we don't have to modify the
        # permissions (so nothing to change w.r.t current code)
        self._validate_data(X, y)

@thomasjpfan
Copy link
Member

thomasjpfan commented May 7, 2024

Thank you for the examples. This makes more sense to me now.

How does the default copy=False and writable=None handle pandas dataframes after 3.0? As of #28348, we adjust the NumPy array and make it writable. Does you proposal change this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants