Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX trigger indptr/indices copy when data are copied in astype #18192

Merged
merged 4 commits into from
Apr 12, 2023

Conversation

glemaitre
Copy link
Contributor

Reference issue

closes gh-8678
closes gh-8679
supersede gh-8679

What does this implement/fix?

In some cases, it happens that the internal arrays of the scipy matrices are memmaped in read-only mode (e.g. using joblib.Parallel). Calling astype calls sum_duplicates (called within _deduped_data) that intends to sort the indices of the matrix in place. It will therefore raise an error due to the read-only property.

Here, I trigger a copy of intptr and indices in the case the dtype does not match.

Additional information

As reported in the issue, this problem was reported several times in scikit-learn: scikit-learn/scikit-learn#15924, scikit-learn/scikit-learn#25935, etc.

@glemaitre
Copy link
Contributor Author

In terms of non-regression tests, the minimal example in the issue #8678 could be used. However, I assume that it would be best to test several formats of sparse matrices.

Unfortunately, I am a bit lost with the test suite. I think that it should go in scipy/sparse/tests/test_base.py. I see a test that is written for astype but since we need special conditions to trigger the regression (read-only + non-sorted indices), I am not sure what is the best way to go forward.

@j-bowhay j-bowhay added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse labels Mar 23, 2023
@perimosocordiae
Copy link
Member

For testing, it should be sufficient to add another method next to the existing test_astype method, like this one:

def test_astype_immutable(self):
    D = array([[2.0 + 3j, 0, 0],
               [0, 4.0 + 5j, 0],
               [0, 0, 0]])
    S = self.spmatrix(D)
    if hasattr(S, 'data'):
        S.data.flags.writeable = False
    for x in supported_dtypes:
        D_casted = D.astype(x)
        S_casted = S.astype(x)
        assert_equal(S_casted.dtype, D_casted.dtype)

I wonder if the root cause issue is actually in the implementation of _deduped_data, rather than astype. If we add support for handling immutable member arrays in _deduped_data, that should fix both this and potentially other lurking bugs.

@glemaitre
Copy link
Contributor Author

I wonder if the root cause issue is actually in the implementation of _deduped_data, rather than astype. If we add support for handling immutable member arrays in _deduped_data, that should fix both this and potentially other lurking bugs.

It was what you intended originally in #8679. Some of the failures in the CI seem to be that the _deduped_data return a data array that is not any more "aligned" with the indptr and indices. For instance, sum_duplicates will sort the indices. So it means that we cannot use self._with_data that try to keep the original indices since they are not synced with the data returned by _deduped_data.

If we make changes in _deduped_data, it seems that we would need to return the associated indices.

@glemaitre
Copy link
Contributor Author

glemaitre commented Mar 24, 2023

The current error is raised when a LIL matrix call to_csr and intends to flatten the array using the lil_flatten_to_array function. The Cython signature does not contain const for the buffer while in the case of read-only, we need to add this qualifier.

Since the signature use object[:] as the first argument, it would not accept a const qualifier. I don't know how to solve this one.

[0, 0, 0]])
S = self.spmatrix(D)
if hasattr(S, 'data'):
S.data.flags.writeable = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we only define the data to be read-only. However, from the original issue, we could also have read-only intptr and indices.

@tylerjereddy tylerjereddy added the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Mar 27, 2023
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch allows the full test suite to pass for me locally on this branch, though it doesn't win any prizes for elegance. Just providing in case it helps get this unstuck.

diff --git a/scipy/sparse/_csparsetools.pyx.in b/scipy/sparse/_csparsetools.pyx.in
index 5c47852c0..e5972eac0 100644
--- a/scipy/sparse/_csparsetools.pyx.in
+++ b/scipy/sparse/_csparsetools.pyx.in
@@ -176,7 +176,12 @@ def _lil_get_lengths_{{NAME}}(object[:] input,
 
 {{define_dispatch_map('_LIL_GET_LENGTHS_DISPATCH', '_lil_get_lengths', IDX_TYPES)}}
 
-def lil_flatten_to_array(object[:] input,
+ctypedef fused obj_fused:
+    object
+    double
+
+
+def lil_flatten_to_array(const obj_fused[:] input,
                          cnp.ndarray output):
     return _LIL_FLATTEN_TO_ARRAY_DISPATCH[output.dtype](input, output)
 
@@ -311,7 +316,7 @@ def _lil_fancy_set_{{PYIDX}}_{{PYVALUE}}(cnp.npy_intp M, cnp.npy_intp N,
 
 
 def lil_get_row_ranges(cnp.npy_intp M, cnp.npy_intp N,
-                       object[:] rows, object[:] datas,
+                       const obj_fused[:] rows, const obj_fused[:] datas,
                        object[:] new_rows, object[:] new_datas,
                        object irows,
                        cnp.npy_intp j_start,

@glemaitre
Copy link
Contributor Author

Thanks @tylerjereddy, I was indeed locked with trying to use the VALUES_TYPES instead at the C-level function but I did not succeed to make it work.

@tylerjereddy
Copy link
Contributor

Let's see if CJ can stomach my Cython hacks :) I'm actually not super familiar with sparse internals so let's see what he thinks.

@@ -176,7 +176,11 @@ def _lil_get_lengths_{{NAME}}(object[:] input,

{{define_dispatch_map('_LIL_GET_LENGTHS_DISPATCH', '_lil_get_lengths', IDX_TYPES)}}

def lil_flatten_to_array(object[:] input,
ctypedef fused obj_fused:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate, but based on cython/cython#2485 it seems like the best workaround for now.

Let's add a comment mentioning the Cython limitation and link to that issue, so that when/if it gets resolved we can remove the hack.

@perimosocordiae
Copy link
Member

Overall this is looking fine. Working around the Cython limitation isn't so bad, especially if we can remove the fused type hack after cython/cython#4712 is merged.

Eventually I'll need to clean up the _deduped_data() situation, but I don't want to block this PR on that.

@perimosocordiae
Copy link
Member

The one failing test (mac/py3.11) has a strange error message:

meson.build:1:0: ERROR: Cython compiler 'cython' cannot compile programs

It doesn't seem to be related to this PR, though, so I'm +1 to merge now.

@glemaitre
Copy link
Contributor Author

Shall I restart the build or do we need to wait for another review for the PR to be merged?

@perimosocordiae
Copy link
Member

Let's wait a few more days to see if someone else wants to review, and if not I'll merge it then.

@perimosocordiae
Copy link
Member

No more comments, so I'll merge now.

@perimosocordiae perimosocordiae merged commit e574cbc into scipy:main Apr 12, 2023
@j-bowhay j-bowhay added this to the 1.11.0 milestone Apr 12, 2023
@tylerjereddy tylerjereddy modified the milestones: 1.11.0, 1.10.2 Apr 16, 2023
tylerjereddy pushed a commit to tylerjereddy/scipy that referenced this pull request Apr 16, 2023
…#18192)

* FIX trigger indptr/indices copy when data are copied in astype

* TST add unit tests

* iter

* Add comment regarding object memview support
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csr_matrix.astype raise ValueError when indices are not sorted and indices or indptr are read-only
4 participants