Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: PyCapsule_SetName segfault #4420

Closed
3 tasks done
tylerjereddy opened this issue Dec 24, 2022 · 17 comments
Closed
3 tasks done

[BUG]: PyCapsule_SetName segfault #4420

tylerjereddy opened this issue Dec 24, 2022 · 17 comments
Labels
triage New bug, unverified

Comments

@tylerjereddy
Copy link

tylerjereddy commented Dec 24, 2022

Required prerequisites

What version (or hash if on master) of pybind11 are you using?

2.10.2

Problem description

A detailed analysis of the situation that caused a hold-up of the SciPy release process on Windows is in this issue: scipy/scipy#17644 (comment)

Our testsuite can't even start with pybind11 2.10.2 -- it literally segfaults during the pytest collection phase, and more specifically import scipy.stats segfaults, and scipy.fft import will fail as desribed here: scipy/scipy#17644 (comment)

Reproducible example code

Unfortunately, I probably don't have the bandwidth to excise an isolated reproducer, but I bet the pybind11 folks will benefit from my DLL analysis over here: scipy/scipy#17644 (comment)

Is this a regression? Put the last known working version here if it is.

2.10.1

@matthew-brett
Copy link

For convenience, here's some information ported over from scipy/scipy#17644 (comment) and similar.

Scipy has a compiled module scipy.fft.pypocketfft. This module compiles and runs fine on all platforms with Pybind 2.10.1.

With Pybind 2.10.2, on Windows only, the compiled pypocketfft module segfaults on import during test collection.

Tyler's investigation revealed this was due a new missing dependency. Direct import generates the following:

>>> import scipy.fft
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\treddy\python310_venv_windows\lib\site-packages\scipy\fft\__init__.py", line 92, in <module>
    from ._helper import next_fast_len
  File "C:\Users\treddy\python310_venv_windows\lib\site-packages\scipy\fft\_helper.py", line 3, in <module>
    from ._pocketfft import helper as _helper
  File "C:\Users\treddy\python310_venv_windows\lib\site-packages\scipy\fft\_pocketfft\__init__.py", line 3, in <module>
    from .basic import *
  File "C:\Users\treddy\python310_venv_windows\lib\site-packages\scipy\fft\_pocketfft\basic.py", line 6, in <module>
    from . import pypocketfft as pfft
ImportError: DLL load failed while importing pypocketfft: A dynamic link library (DLL) initialization routine failed.

Further investigation revealed that there was a difference in declared imports between Pybind 2.10.1 and 2.10.2 - the 2.10.2 binary no longer imported Function PyCapsule_SetName (with Import from module python310.dll).

@rgommers wondered whether this might be due to #4254 .

@rwgk
Copy link
Collaborator

rwgk commented Dec 26, 2022

I left a comment here: scipy/scipy#17644 (comment)

@tylerjereddy
Copy link
Author

I excised the reproducer from SciPy so that you now have a standalone project with a single C++ source file, 1 C++ header file, and a build file: https://github.com/tylerjereddy/pybind_repro

The README there has a short description of reproducing on 2.10.2 vs. success with 2.10.1. It could probably be cut down to an even simpler reproducer that is test-suite appropriate, but that probably gets into the weeds of C++ that are better dealt with by pybind11 team.

At the very least, this should completely uncouple your debugging cycles from SciPy, which is going to save a ton of time most likely.

@EthanSteinberg
Copy link
Collaborator

EthanSteinberg commented Dec 27, 2022

This turned out to not be a pybind11 issue as it was a bug within scipy. See scipy/scipy#17662.

We should fix have better error logging within pybind11 though. See #4426

@rgommers
Copy link

Thanks a lot for getting to the bottom of this @lalaland!

This turned out to not be a pybind11 issue

Given that this went into a bug fix rather than a minor release, and caused issues in SciPy and PyTorch already, it may still be wise to yank this release? You can argue that going from "working in practice but UB" to "hard to debug segfault" is not a pybind11 bug technically, but it's still pretty impactful. It even defeated SciPy's fairly careful pinning for the last already released version ("pybind11>=2.4.3,<2.11.0", here). Yanking is a cheap safety measure here.

@EthanSteinberg
Copy link
Collaborator

EthanSteinberg commented Dec 28, 2022

@rgommers Yeah, I think I agree with you. This is not acceptable for a bug fix change and it should probably be reverted and added back with better error reporting in the next minor release. Super sorry for all the work this caused on your part (especially during this time of the year).

@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2022

I'm sorry, too, that the failure reporting is so unobvious on some platforms.

But what is "yanking"?

I was thinking we make a 2.10.3 release with the better error reporting asap. Should we do something else?

@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2022

I forgot to add: only debug builds are affected.

@matthew-brett
Copy link

Yanking is a process you can do on Pypi to remove the release so it does not appear as a candidate for installation.

@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2022 via email

@henryiii
Copy link
Collaborator

Okay, I've yanked it. Let's get a 2.10.3 out ASAP since anyone relying on the Windows ARM support added in 2.10.2 is now broken (unless they request >=2.10.2, which is not likely just to add a single modern platform).

@henryiii
Copy link
Collaborator

From what I understand, we revert the UB -> error behavior for 2.10.3, and make it an error with better error reporting in 2.11?

@rgommers
Copy link

Super sorry for all the work this caused on your part (especially during this time of the year).

No worries at all, and thanks for your help in resolving the issue!

@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2022

I wrote down my thoughts here: #4432 (comment)

@henryiii
Copy link
Collaborator

Could you verify the UB was actually working as intended in SciPy (and/or PyTorch)? (The py::none defined was actually correctly an instance of None after Python initialized). If that's the case, I think we should have this new check (added in 2.10.2) disabled by default in 2.10.3 (and enabled in master / 2.11). If not, I think it was exposing a real issue and was fine for a patch release.

@EthanSteinberg
Copy link
Collaborator

@henryiii I highly doubt the UB in SciPy did anything wrong. Calling inc_ref on a none object is unlikely to break anything. Still worth fixing though.

@rgommers
Copy link

Agreed - and no bug reports in ~3 years related to this for heavily used functionality in scipy.fft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage New bug, unverified
Projects
None yet
Development

No branches or pull requests

6 participants