Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan in matrix matrix multiplication on linux/arm64 guest on macos/arm64 host #18061

Closed
ogrisel opened this issue Dec 23, 2020 · 17 comments
Closed

Comments

@ogrisel
Copy link
Contributor

ogrisel commented Dec 23, 2020

I have observed many incorrect values when running the scikit-learn test suite in a docker container on macos with Apple Silicon M1. I used https://docs.docker.com/docker-for-mac/apple-m1/ to install docker on this machine.

Then I run docker with:

% docker run -v `pwd`:/io --platform linux/arm64 --rm -ti python bash

to get a linux/arm64 environment on which I installed build-essential and gfortran using APT and installed numpy and OpenBLAS either with pip or from source.

I could reduce the problem to the following minimal reproducer:

import numpy as np

n = 1000
d = 2
X = np.ones(shape=(n, d), dtype=np.float64)
Y = np.ones(shape=(n, d), dtype=np.float64)

K = X @ Y.T

print(K.shape, K.dtype)
print(K[0, 256])
assert np.isfinite(K).all()

which yields

(1000, 1000) float64
nan
Traceback (most recent call last):
  File "/io/arm64_bug.py", line 12, in <module>
    assert np.isfinite(K).all()
AssertionError

I can reproduce this problem both with the numpy 1.19.4 manylinux aarch64 wheel from PyPI and numpy master built against OpenBLAS master (DYNAMIC_ARCH=1).

Note that the problem disappears if n < 363...

I suspected a bug in OpenBLAS itself, so I tried with the following reproducer candidate which should be equivalent to the numpy code above:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
#include "cblas.h"


int main() {
    int i;
    int found_nan = 0;
    int m = 1000;
    int n = 1000;
    int k = 2;
    double *a = (double*) malloc(sizeof(double) * m * k);
    double *b = (double*) malloc(sizeof(double) * n * k);
    double *c = (double*) malloc(sizeof(double) * m * n);
    for (i=0; i<m*k; i++) {
        a[i] = 1.0;
    }
    for (i=0; i<n*k; i++) {
        b[i] = 1.0;
    }
    for (i=0; i<m*n; i++) {
        c[i] = 0.0;
    }
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1., a, k, b, k, 0.0, c, n);
    for (i=0; i<m*n; i++) {
        if (isnan(c[i])) {
            printf("c[%d] is nan\n", i);
            found_nan = 1;
        }
    }
    if (!found_nan) {
        printf("dgemm ok\n");
    }
    return 0;
}

but I cannot reproduce in this case:

# gcc -static -o arm64_bug arm64_bug.c -I/usr/local/include/openblas/ -L/usr/local/lib  -lopenblas -lpthread -lgfortran -lm
# ./arm64_bug
dgemm ok

I built the above snippet against OpenBLAS master which could be use to reproduce the original problem when run with numpy.

Any idea how to investigate this problem further?

When using numpy / openblas directly on the macos/arm64 host (installed with conda-forge), I do not get any problem.

EDIT:

I also tried the following variant using np.einsum and it does not fail fail in a similar way but only when optimize=True which makes sense.

import numpy as np

n = 1000
d = 2
X = np.ones(shape=(n, d), dtype=np.float64)
Y = np.ones(shape=(n, d), dtype=np.float64)

K = np.einsum("ik,jk->ij", X, Y, optimize=True)
print(K.shape, K.dtype)
print(K[0, 256])
assert np.isfinite(K).all()
@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

I am not sure what numpy devs can do to fix this, but I would appreciate if someone knowledgeable can double check that my C reproducer is indeed equivalent to the failing numpy code.

Let me know if you want me to run complementary analysis.

@rossbar
Copy link
Contributor

rossbar commented Dec 23, 2020

Can you verify that the Accelerate lib isn't being picked up when building with locally? It's known to be buggy and numpy has checks to ensure that it isn't accidentally used, but maybe those checks aren't working properly on the M1 platform?

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

This is inside a docker container in a Linux VM, so Accelerate is not visible.

Anyway on numpy master, building against accelerate is no longer possible. And the 1.19.4 wheel has a hardcoded rpath that points to the vendored openblas.

But just to be sure, here is the output of the python setup.py config in this docker environment.

root@0d5642a9b27f:/io/code/numpy# python setup.py config
Running from numpy source directory.
/io/code/numpy/setup.py:422: UserWarning: Unrecognized setuptools command, proceeding with generating Cython sources and expanding templates
  run_build = parse_setuppy_commands()
Cythonizing sources
numpy/random/_bounded_integers.pxd.in has not changed
numpy/random/_bounded_integers.pyx has not changed
numpy/random/_bounded_integers.pyx.in has not changed
numpy/random/_common.pyx has not changed
numpy/random/_generator.pyx has not changed
numpy/random/_mt19937.pyx has not changed
numpy/random/_pcg64.pyx has not changed
numpy/random/_philox.pyx has not changed
numpy/random/_sfc64.pyx has not changed
numpy/random/bit_generator.pyx has not changed
numpy/random/mtrand.pyx has not changed
blas_opt_info:
blas_mkl_info:
customize UnixCCompiler
  libraries mkl_rt not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

blis_info:
  libraries blis not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

openblas_info:
C compiler: gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC

creating /tmp/tmpzgkl6yi3/tmp
creating /tmp/tmpzgkl6yi3/tmp/tmpzgkl6yi3
compile options: '-c'
gcc: /tmp/tmpzgkl6yi3/source.c
gcc -pthread /tmp/tmpzgkl6yi3/tmp/tmpzgkl6yi3/source.o -L/usr/local/lib -lopenblas -o /tmp/tmpzgkl6yi3/a.out
  FOUND:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

  FOUND:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

non-existing path in 'numpy/distutils': 'site.cfg'
lapack_opt_info:
lapack_mkl_info:
  libraries mkl_rt not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

openblas_lapack_info:
C compiler: gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC

creating /tmp/tmprsooklep/tmp
creating /tmp/tmprsooklep/tmp/tmprsooklep
compile options: '-c'
gcc: /tmp/tmprsooklep/source.c
gcc -pthread /tmp/tmprsooklep/tmp/tmprsooklep/source.o -L/usr/local/lib -lopenblas -o /tmp/tmprsooklep/a.out
/usr/bin/ld: /usr/local/lib/libopenblas.a(iparmq.f.o): in function `iparmq_':
iparmq.f:(.text+0xf8): undefined reference to `logf'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_NUN.c.o): in function `ztrmv_thread_NUN':
ztrmv_thread_NUN.c:(.text+0x4f0): undefined reference to `sqrt'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_NLN.c.o): in function `ztrmv_thread_NLN':
ztrmv_thread_NLN.c:(.text+0x5b0): undefined reference to `sqrt'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_NUU.c.o): in function `ztrmv_thread_NUU':
ztrmv_thread_NUU.c:(.text+0x45c): undefined reference to `sqrt'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_NLU.c.o): in function `ztrmv_thread_NLU':
ztrmv_thread_NLU.c:(.text+0x51c): undefined reference to `sqrt'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_TUN.c.o): in function `ztrmv_thread_TUN':
ztrmv_thread_TUN.c:(.text+0x524): undefined reference to `sqrt'
/usr/bin/ld: /usr/local/lib/libopenblas.a(ztrmv_thread_TLN.c.o):ztrmv_thread_TLN.c:(.text+0x5c4): more undefined references to `sqrt' follow
collect2: error: ld returned 1 exit status
  NOT AVAILABLE

openblas_clapack_info:
  libraries openblas,lapack not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

flame_info:
  libraries flame not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

atlas_3_10_threads_info:
Setting PTATLAS=ATLAS
  libraries lapack_atlas not found in /usr/local/lib
  libraries tatlas,tatlas not found in /usr/local/lib
  libraries lapack_atlas not found in /usr/lib
  libraries tatlas,tatlas not found in /usr/lib
  libraries lapack_atlas not found in /usr/lib/aarch64-linux-gnu
  libraries tatlas,tatlas not found in /usr/lib/aarch64-linux-gnu
<class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
  NOT AVAILABLE

atlas_3_10_info:
  libraries lapack_atlas not found in /usr/local/lib
  libraries satlas,satlas not found in /usr/local/lib
  libraries lapack_atlas not found in /usr/lib
  libraries satlas,satlas not found in /usr/lib
  libraries lapack_atlas not found in /usr/lib/aarch64-linux-gnu
  libraries satlas,satlas not found in /usr/lib/aarch64-linux-gnu
<class 'numpy.distutils.system_info.atlas_3_10_info'>
  NOT AVAILABLE

atlas_threads_info:
Setting PTATLAS=ATLAS
  libraries lapack_atlas not found in /usr/local/lib
  libraries ptf77blas,ptcblas,atlas not found in /usr/local/lib
  libraries lapack_atlas not found in /usr/lib
  libraries ptf77blas,ptcblas,atlas not found in /usr/lib
  libraries lapack_atlas not found in /usr/lib/aarch64-linux-gnu
  libraries ptf77blas,ptcblas,atlas not found in /usr/lib/aarch64-linux-gnu
<class 'numpy.distutils.system_info.atlas_threads_info'>
  NOT AVAILABLE

atlas_info:
  libraries lapack_atlas not found in /usr/local/lib
  libraries f77blas,cblas,atlas not found in /usr/local/lib
  libraries lapack_atlas not found in /usr/lib
  libraries f77blas,cblas,atlas not found in /usr/lib
  libraries lapack_atlas not found in /usr/lib/aarch64-linux-gnu
  libraries f77blas,cblas,atlas not found in /usr/lib/aarch64-linux-gnu
<class 'numpy.distutils.system_info.atlas_info'>
  NOT AVAILABLE

lapack_info:
  libraries lapack not found in ['/usr/local/lib', '/usr/lib', '/usr/lib/aarch64-linux-gnu']
  NOT AVAILABLE

/io/code/numpy/numpy/distutils/system_info.py:1849: UserWarning: 
    Lapack (http://www.netlib.org/lapack/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [lapack]) or by setting
    the LAPACK environment variable.
  return getattr(self, '_calc_info_{}'.format(name))()
lapack_src_info:
  NOT AVAILABLE

/io/code/numpy/numpy/distutils/system_info.py:1849: UserWarning: 
    Lapack (http://www.netlib.org/lapack/) sources not found.
    Directories to search for the sources can be specified in the
    numpy/distutils/site.cfg file (section [lapack_src]) or by setting
    the LAPACK_SRC environment variable.
  return getattr(self, '_calc_info_{}'.format(name))()
  NOT AVAILABLE

numpy_linalg_lapack_lite:
  FOUND:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

Warning: attempted relative import with no known parent package
/usr/local/lib/python3.9/distutils/dist.py:274: UserWarning: Unknown distribution option: 'define_macros'
  warnings.warn(msg)
running config

Note that the LAPACK from by local OpenBLAS build is broken, but this is fine because it should not impact a simple matrix matrix product (DGEMM / BLAS level 3).

Also:

root@0d5642a9b27f:~# python -m threadpoolctl -i numpy
[
  {
    "filepath": "/usr/local/lib/python3.9/site-packages/numpy.libs/libopenblasp-r0-c9541ae0.3.9.dev.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9.dev",
    "num_threads": 4,
    "threading_layer": "pthreads"
  }
]

It might be possible that the macos virtualization layer that underpins the linux system used by docker on macos is trying to do clever things and is actually broken. However I would have expected to reproduce the problem in my pure C program that calls into cblas_dgemm directly.

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

I edit the original report to also include an np.einsum variant that does not have the problem...

@seberg
Copy link
Member

seberg commented Dec 23, 2020

@ogrisel einsum used to default to optimize=True for a bit, you can probably preproduce if you pass that. EDIT: That will make many einsum calls use dot internally.

About the C repro, I would have to check more careful, but I think it may be that CblasNoTrans, CblasTrans need to be swapped. The actual call is here:

gemm(typenum, Order, Trans1, Trans2, L, N, M, ap1, lda, ap2, ldb,

Going to (but not changing the order):
CBLAS_FUNC(cblas_dgemm)(order, transA, transB, m, n, k, 1.,

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

Thanks @seberg, I can now reproduce the original problem when using np.einsum with optimize=True. I did not know about this option. Thanks.

I also swapped the transpose flags and now get the following output:

./arm64_bug
 ** On entry to DGEMM  parameter number  8 had an illegal value
dgemm ok

which means that none of the resulting values check the isnan condition, yet DGEMM complained...

@seberg
Copy link
Member

seberg commented Dec 23, 2020

Cool, the only other thing I can think of to note, is that some VMs did end up picking the wrong kernels because VMs pretended to support more than they did. I doubt that is the case, but you can likely still work around the issue by picking a specific OpenBLAS kernel (and checking which kernel is in use, and if using a different one fixes it might be helpful there).

@charris
Copy link
Member

charris commented Dec 23, 2020

If you are compiling with clang, #18005 may be relevant.

So, it turns out this is expected behavior in clang. Clang has a flag to force strict floating point exception compliance by setting -ffp-exception-behavior=strict. The default, however, is set to ignore which means:

The compiler assumes that the exception status flags will not be read and that floating point exceptions will be masked

No wonder I have had trouble with clang before too. The docs has a section on it which you can find by searching for ffp-exception-behavior here: https://clang.llvm.org/docs/UsersManual.html

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

OpenBLAS is detecting the generic ARMv8 architecture (which seems correct):

root@0d5642a9b27f:/io# OPENBLAS_VERBOSE=1 OPENBLAS_CORETYPE=invalid python arm64_bug.py 
Core not found: invalid
Falling back to generic ARMV8 core
(1000, 1000) float64
2.0
Traceback (most recent call last):
  File "/io/arm64_bug.py", line 11, in <module>
    assert np.isfinite(K).all()
AssertionError

For reference:

root@0d5642a9b27f:/io# lscpu 
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  4
Core(s) per socket:  1
Socket(s):           1
Vendor ID:           0x00
Model:               0
Stepping:            0x0
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit uscat ilrcpc flagm ssbs

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

I am compiling with the default compiler of the python docker image (based on debian 10.7):

root@0d5642a9b27f:/io# cat /etc/debian_version 
10.7
root@0d5642a9b27f:/io# gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
root@0d5642a9b27f:/io# clang
bash: clang: command not found

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

Forgot to mention, if I replace K = X @ Y.T by K = X @ X.T, then there is no bug.

@seberg
Copy link
Member

seberg commented Dec 23, 2020

@ogrisel The latter code goes into a different BLAS call (syrk) for optimization, at least on newer NumPy versions (not sure, probably for a few years now).

I was just thinking you could try some very basic core type (haswell if that makes sense?) or so, to see if it changes with the core that is picked.

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 23, 2020

Haswell is for the x86_64 family. Here this is an ARM64 architecture and even with the generic ARMV8 core it fails on this ARM64 machine.

Furthermore, if I run the same code on macos/arm64 directly (without docker and underlying linux vm), openblas does not complain and does not yield the nan:

(dev) ogrisel@arm64-apple-darwin20 ~ % OPENBLAS_VERBOSE=1 OPENBLAS_CORETYPE=armv8 python arm64_bug.py 
(1000, 1000) float64
2.0
(dev) ogrisel@arm64-apple-darwin20 ~ % OPENBLAS_VERBOSE=1 OPENBLAS_CORETYPE=invalid python arm64_bug.py 
Core not found: invalid
Falling back to generic ARMV8 core
(1000, 1000) float64
2.0

@ogrisel
Copy link
Contributor Author

ogrisel commented Dec 24, 2020

I tried to extend the C reproducer to try more shapes and I still cannot reproduce: https://gist.github.com/ogrisel/efcfca806b39bb51d4ce695bad167b9c

@seberg
Copy link
Member

seberg commented Dec 24, 2020

Ohh, somehow I thought when you posted this:

./arm64_bug
 ** On entry to DGEMM  parameter number  8 had an illegal value
dgemm ok

it was a reproducer. Is that actually incorrect with the transpose, or does it at least point to something being off?

@ogrisel
Copy link
Contributor Author

ogrisel commented Jan 5, 2021

Is that actually incorrect with the transpose?

@seberg I think so. The original reproducer seems correct to me. In the extended candidate reproducer https://gist.github.com/ogrisel/efcfca806b39bb51d4ce695bad167b9c, I vary n and k and the results seem correct (I fail to reproduce the bug I observe when calling openblas DGEMM via numpy). I will probably need to use a debugger to double check the exact cblas_dgemm arguments when calling from numpy.

@seberg
Copy link
Member

seberg commented Jan 29, 2024

Going to close this, since arm64 is pretty common by now, so I am hoping it was solved in the meantime (i.e. if it wasn't we would have new issues opend probably).

@seberg seberg closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants