New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan in matrix matrix multiplication on linux/arm64 guest on macos/arm64 host #18061
Comments
I am not sure what numpy devs can do to fix this, but I would appreciate if someone knowledgeable can double check that my C reproducer is indeed equivalent to the failing numpy code. Let me know if you want me to run complementary analysis. |
Can you verify that the |
This is inside a docker container in a Linux VM, so Accelerate is not visible. Anyway on numpy master, building against accelerate is no longer possible. And the 1.19.4 wheel has a hardcoded rpath that points to the vendored openblas. But just to be sure, here is the output of the
Note that the LAPACK from by local OpenBLAS build is broken, but this is fine because it should not impact a simple matrix matrix product (DGEMM / BLAS level 3). Also:
It might be possible that the macos virtualization layer that underpins the linux system used by docker on macos is trying to do clever things and is actually broken. However I would have expected to reproduce the problem in my pure C program that calls into |
I edit the original report to also include an |
@ogrisel einsum used to default to About the C repro, I would have to check more careful, but I think it may be that numpy/numpy/core/src/common/cblasfuncs.c Line 678 in 669cd13
Going to (but not changing the order): numpy/numpy/core/src/common/cblasfuncs.c Line 36 in 669cd13
|
Thanks @seberg, I can now reproduce the original problem when using I also swapped the transpose flags and now get the following output:
which means that none of the resulting values check the |
Cool, the only other thing I can think of to note, is that some VMs did end up picking the wrong kernels because VMs pretended to support more than they did. I doubt that is the case, but you can likely still work around the issue by picking a specific OpenBLAS kernel (and checking which kernel is in use, and if using a different one fixes it might be helpful there). |
If you are compiling with clang, #18005 may be relevant.
|
OpenBLAS is detecting the generic ARMv8 architecture (which seems correct):
For reference:
|
I am compiling with the default compiler of the python docker image (based on debian 10.7):
|
Forgot to mention, if I replace |
@ogrisel The latter code goes into a different BLAS call (syrk) for optimization, at least on newer NumPy versions (not sure, probably for a few years now). I was just thinking you could try some very basic core type (haswell if that makes sense?) or so, to see if it changes with the core that is picked. |
Haswell is for the x86_64 family. Here this is an ARM64 architecture and even with the generic ARMV8 core it fails on this ARM64 machine. Furthermore, if I run the same code on macos/arm64 directly (without docker and underlying linux vm), openblas does not complain and does not yield the nan:
|
I tried to extend the C reproducer to try more shapes and I still cannot reproduce: https://gist.github.com/ogrisel/efcfca806b39bb51d4ce695bad167b9c |
Ohh, somehow I thought when you posted this:
it was a reproducer. Is that actually incorrect with the transpose, or does it at least point to something being off? |
@seberg I think so. The original reproducer seems correct to me. In the extended candidate reproducer https://gist.github.com/ogrisel/efcfca806b39bb51d4ce695bad167b9c, I vary n and k and the results seem correct (I fail to reproduce the bug I observe when calling openblas DGEMM via numpy). I will probably need to use a debugger to double check the exact |
Going to close this, since arm64 is pretty common by now, so I am hoping it was solved in the meantime (i.e. if it wasn't we would have new issues opend probably). |
I have observed many incorrect values when running the scikit-learn test suite in a docker container on macos with Apple Silicon M1. I used https://docs.docker.com/docker-for-mac/apple-m1/ to install docker on this machine.
Then I run docker with:
to get a linux/arm64 environment on which I installed
build-essential
andgfortran
using APT and installed numpy and OpenBLAS either with pip or from source.I could reduce the problem to the following minimal reproducer:
which yields
I can reproduce this problem both with the numpy 1.19.4 manylinux aarch64 wheel from PyPI and numpy master built against OpenBLAS master (
DYNAMIC_ARCH=1
).Note that the problem disappears if
n < 363
...I suspected a bug in OpenBLAS itself, so I tried with the following reproducer candidate which should be equivalent to the numpy code above:
but I cannot reproduce in this case:
I built the above snippet against OpenBLAS master which could be use to reproduce the original problem when run with numpy.
Any idea how to investigate this problem further?
When using numpy / openblas directly on the macos/arm64 host (installed with conda-forge), I do not get any problem.
EDIT:
I also tried the following variant using
np.einsum
and it doesnot failfail in a similar way but only whenoptimize=True
which makes sense.The text was updated successfully, but these errors were encountered: