Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Segmentation fault when calling pytorch function after np.exp (numpy 1.21.2) #21714

Open
kokamido opened this issue Jun 10, 2022 · 4 comments
Labels
00 - Bug 31 - Third-party binaries Install/import issues other than Anaconda-specific 57 - Close? Issues which may be closable unless discussion continued

Comments

@kokamido
Copy link

kokamido commented Jun 10, 2022

Describe the issue:

Hi! There is an issue connected to numpy and pytorch. I can't reproduce it with numpy 1.21.3, but in 1.21.2 it exists. If I run provided code example with SIZE=15 then both print functions (they are exactly the same) will print True. If I run it with SIZE=20, the first print will display True but the second will crash because of segmentation fault. If I run it with SIZE=1000 it will display True and False. If I remove np.exp call the code will print True True for any positive int SIZE.
This behavior can be reproduced in the following docker container:

FROM ubuntu:focal-20220531

RUN apt update

RUN apt install -y wget

# Miniconda installation
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh -O ~/miniconda.sh && \
    /bin/bash ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh

RUN /opt/conda/bin/conda install  -c pytorch -c nvidia pytorch==1.10.1

RUN /opt/conda/bin/conda install  -c pytorch -c nvidia numpy==1.21.2

ENTRYPOINT /bin/bash

Reproduce the code example:

import torch
import numpy as np

SIZE=19

print(torch.all(torch.isfinite(torch.fft.fft(torch.eye(SIZE), dim=1))))
np.exp([2])
print(torch.all(torch.isfinite(torch.fft.fft(torch.eye(SIZE), dim=1))))

Error message:

No response

NumPy/Python version information:

numpy==1.21.2
pytorch==1.10.1

@seberg
Copy link
Member

seberg commented Jun 10, 2022

I would suspect it to be related to gh-20405, that would cause pretty random stuff. That issue is fixed in 1.22.0 and later.

I am not quite sure how old that issue was. The complexity is that there was the additional complexity of a compiler bug being involved. Will have to dig deeper, but it may be that the issue only "appeared" with a new GCC release, so at the time of the release all may have been fine, and now it is not because the nvidia channel uses a newer compiler...

@seberg
Copy link
Member

seberg commented Jun 10, 2022

From the discussion in gh-20356, I suspect that the bug would only occur with gcc 10. I wonder what the best thing is, also a bit related to gh-21713. EDIT: Not sure which gcc versions it appears or when/whether it got fixed. Older ones probably have not optimized as aggressively and did not show it.

Maybe we should backport some of these at least as source-only, since channels like the nvidia one can then still pick them up or at least find them.

EDIT: Nvm, the nvidia channel of course only has nvidia packages, this would be from the default anaconda channel.

@seberg seberg added the 31 - Third-party binaries Install/import issues other than Anaconda-specific label Jun 10, 2022
@seberg
Copy link
Member

seberg commented Jun 10, 2022

@kokamido I am not quite sure how to best proceed. Maybe you can confirm that this is on a machine with a SkylakeX CPU? It might be nice to confirm that the specific patch works, but that will require compiling NumPy on an affected machine (I don't have a skylakex machine here).

If this is important to you to get a 1.21.x release that is guaranteed fixed, maybe we need to open an Anaconda issue?

@kokamido
Copy link
Author

In my tests my repro works with both Intel Xeon Gold 5320T (which is Ice Lake) and Intel Core i7-11800H (which is Tiger Lake). And it doesn't reproduce with 1.21.3 and 1.21.6 from Anaconda (I haven't tested 1.21.4 and 1.21.5).
It's not necessary to me to get a fixed version of 1.21.x release because I can use numpy>=1.22. I opened this issue because the problem looked bizarre and I couldn't google anything directly related to it.
If problem was completely fixed in release 1.22, then this issue can be closed. Otherwise I can test something in my environment if you think it will be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 31 - Third-party binaries Install/import issues other than Anaconda-specific 57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

No branches or pull requests

2 participants