Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: scipy.stats.wilcoxon in 1.13 fails on 2D array with nan -- 1D works #20591

Open
JonathanShor opened this issue Apr 26, 2024 · 6 comments · May be fixed by #20592
Open

BUG: scipy.stats.wilcoxon in 1.13 fails on 2D array with nan -- 1D works #20591

JonathanShor opened this issue Apr 26, 2024 · 6 comments · May be fixed by #20592
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@JonathanShor
Copy link

JonathanShor commented Apr 26, 2024

Describe your issue.

A regression seemingly introduced in 1.13. A 2D array with NaNs in some columns ValueErrors out as soon as it reaches a column with a NaN. It must be truly 2D, i.e. A.shape[1] > 1 is True. It also only occurs when there are more than 50 rows, i.e. A.shape[0] > 50.

Passing just that column with the NaNs, i.e. passing it as 1D, operates as expected. For example, stats.wilcoxon(A[:50,:] or stats.wilcoxon(A[:,1]) after the code example below runs fine.

Reproducing Code Example

from scipy import stats
import numpy as np

rng = np.random.default_rng(123)
A = rng.normal(size=(51,2))
A[5,1] = np.nan
stats.wilcoxon(A)

Error message

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/isilon/LFMI/VMdrive/Jonathan/PriorWeightingMooney/Visual/ECoG/Code/.venv/lib/python3.11/site-packages/scipy/_lib/_util.py", line 794, in wrapper
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/isilon/LFMI/VMdrive/Jonathan/PriorWeightingMooney/Visual/ECoG/Code/.venv/lib/python3.11/site-packages/scipy/stats/_axis_nan_policy.py", line 603, in axis_nan_policy_wrapper
    res = np.apply_along_axis(hypotest_fun, axis=0, arr=x)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/isilon/LFMI/VMdrive/Jonathan/PriorWeightingMooney/Visual/ECoG/Code/.venv/lib/python3.11/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
    ~~~~^^^^^
ValueError: could not broadcast input array from shape (2,) into shape (3,)

SciPy/NumPy/Python version and system information

>>> import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info); scipy.show_config()

1.13.0 1.26.4 sys.version_info(major=3, minor=11, micro=5, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas
    openblas configuration: USE_64BITINT=0 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=64
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.26.dev
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /usr/local/include
    lib directory: /usr/local/lib
    name: openblas
    openblas configuration: USE_64BITINT=0 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=64
    pc file directory: /usr/local/lib/pkgconfig
    version: 0.3.26.dev
  pybind11:
    detection method: config-tool
    include directory: unknown
    name: pybind11
    version: 2.12.0
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 10.2.1
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 10.2.1
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.10
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 10.2.1
  pythran:
    include directory: ../../tmp/pip-build-env-ilw0s1wu/overlay/lib/python3.11/site-packages/pythran
    version: 0.15.0
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path: /opt/python/cp311-cp311/bin/python
  version: '3.11'
@JonathanShor JonathanShor added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Apr 26, 2024
@mdhaber
Copy link
Contributor

mdhaber commented Apr 26, 2024

I'll fix this shortly. In the meantime, please use:

stats.wilcoxon(A, method='approx')

The problem is just with method='auto' (and method='exact' if there are zeros).

@lucascolley lucascolley added this to the 1.14.0 milestone Apr 26, 2024
@JonathanShor
Copy link
Author

I'll fix this shortly. In the meantime, please use:

stats.wilcoxon(A, method='approx')

The problem is just with method='auto' (and method='exact' if there are zeros).

Should I expect this to have worse performance than 1.12 in some situations? This is what I'm seeing, tho haven't set up a formal test. 1.12 using method='auto' is about 35% faster than 1.13 using method='approx'.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 27, 2024

What is the shape of the input, and does it have NaNs?
Try stats.wilcoxon(A, _no_deco=True) if you want a speed boost, but it will behave as though it omits NaNs.

But yes - in specific situations, you may observe a decrease in speed. Between 1.12 and 1.13, wilcoxon was rewritten to be natively vectorized and therefore faster in cases that would typically be slow. This added a few microseconds of constant overhead on short 1D slices, but the gain in performance for >1D arrays is significant, and it also happens to be faster for slices that are very long (>10k elements). Another potential advantage of the new method is that is automatically omits NaNs, which we can use to improve the performance of nan_policy='omit' in the future, which is usually very slow.

However, currently when the input has NaNs and is >1D, the implementation loops over each slice individually (as it did before, even when there were no NaNs). This looping combined with the constant overhead on each slice seems to result in the performance regression you're seeing. However, you can probably get much better performance now than before if you use stats.wilcoxon(A, _no_deco=True) and add NaNs to the elements corresponding with input slices that had NaNs.

@JonathanShor
Copy link
Author

What is the shape of the input, and does it have NaNs? Try stats.wilcoxon(A, _no_deco=True) if you want a speed boost, but it will behave as though it omits NaNs.

Typically 20-200 rows, by ~1500 columns. So, far from 10k elements per slice. Would you expect me to see a slight benefit staying with 1.13 and using _no_deco=True?

Columns with NaNs, should they exist, are only present in a continuous span at the top of the index. Would I benefit from chunking the columns by NaN presence and calling wilcoxon twice?

@JonathanShor
Copy link
Author

I went ahead and tried _no_deco=True since omitting NaNs is my desired behavior, and I'm not seeing a slight benefit, I'm seeing almost two orders of magnitude speed up! I had to double check I had actually turned parallelism in my setup off for the speed test.

Nice job on this update!

@mdhaber
Copy link
Contributor

mdhaber commented Apr 29, 2024

Great to hear : ) Hopefully it will get array API support before too long, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
4 participants
@mdhaber @JonathanShor @lucascolley and others