Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meson errors with Python 3.10 #11864

Closed
lazka opened this issue Jun 17, 2022 · 42 comments
Closed

Meson errors with Python 3.10 #11864

lazka opened this issue Jun 17, 2022 · 42 comments
Labels

Comments

@lazka
Copy link
Member

lazka commented Jun 17, 2022

It randomly fails when calling into meson scripts:

  1. mesa:

FAILED: src/util/u_unfilled_gen.c
"D:/M/msys64/mingw64/bin/meson" "--internal" "exe" "--capture" "src/util/u_unfilled_gen.c" "--" "D:\M\msys64\mingw64\bin/python3.EXE" "../../src/util/indices/u_unfilled_gen.py"

FAILED: src/gallium/targets/osmesa/osmesa.dll.p/osmesa.dll.symbols
"D:/M/msys64/mingw64/bin/meson" "--internal" "symbolextractor" "C:/_/mingw-w64-mesa/src/mesa-22.1.2/build/windows-x86_64" src/gallium/targets/osmesa/osmesa.dll "src/gallium/targets/osmesa/osmesa.dll.a" src/gallium/targets/osmesa/osmesa.dll.p/osmesa.dll.symbols

  1. gobject-introspection

FAILED: girepository/libgirepository-1.0-1.dll.p/libgirepository-1.0-1.dll.symbols
"D:/a/msys64/clang64/bin/meson" "--internal" "symbolextractor" "C:/M/mingw-w64-gobject-introspection/src/build-x86_64-w64-mingw32" girepository/libgirepository-1.0-1.dll "girepository/libgirepository-1.0.dll.a" girepository/libgirepository-1.0-1.dll.p/libgirepository-1.0-1.dll.symbols

There was one more packages during the Python rebuild that failed that way (with "symbolextractor" I think), but I didn't write it down sadly.

This was referenced Jun 17, 2022
@eli-schwartz
Copy link

eli-schwartz commented Jun 17, 2022

It's completely bizarre that the symbolextractor fails without emitting an error message of any sort. It's reminiscent of pypa/pip#10875 (was a bug in the launcher *.exe files pip/distlib uses on Windows, reproducible in CI but not running interactively).

@lazka
Copy link
Member Author

lazka commented Jun 17, 2022

thanks, good point. The last meson version before was built with the same setuptools version though, so there shouldn't be any difference regarding the launcher.

I can try porting the meson package from setuptools to builder/installer so it gets a newer launcher. Maybe that makes a difference, or at least gives some more insight.

@naveen521kk
Copy link
Member

(was a bug in the launcher *.exe files pip/distlib uses on Windows, reproducible in CI but not running interactively).

Hmm, meson was built with setuptools launcher which shouldn't have this issue afaik.

@lazka
Copy link
Member Author

lazka commented Jun 17, 2022

same error with the python-installer launcher (#11865)

@eli-schwartz
Copy link

For the record, last I saw the latest version of distlib was the one with the wonky binaries, and pip specifically worked around this by downgrading the bundled distlib and rereleasing.

I'm not sure what other installers do on Windows specifically. They may bundle different versions of distlib.

@lazka
Copy link
Member Author

lazka commented Jun 17, 2022

I fetched the meson build dir from a failed CI build, but there is nothing helpful in there.

Just to point out why it might be a different issue: This happens randomly, it sometimes works, sometimes doesn't in CI. It started with the Update to Python 3.10 over the last week, could be a coincidence, but mesa built fine 2 weeks ago.

I'll (hackily) try forcing a sys.stderr for symbolextractor in CI and see if that helps next.

@lazka
Copy link
Member Author

lazka commented Jun 17, 2022

After some more printf debugging, the internal meson command runs through fine, so something after that fails somehow.. I'm giving up for now..

@lazka lazka pinned this issue Jun 18, 2022
kcgen added a commit to dosbox-staging/dosbox-staging that referenced this issue Jun 18, 2022
Reason for retry is due to intermittent Python symbol extractor
failure under MSYS2 reported in bug:
msys2/MINGW-packages#11864
kcgen added a commit to dosbox-staging/dosbox-staging that referenced this issue Jun 18, 2022
Reason for retry is due to intermittent Python symbol extractor
failure under MSYS2 reported in bug:
msys2/MINGW-packages#11864
@kcgen
Copy link

kcgen commented Jun 18, 2022

@lazka , after seeing you mention the issue was intermittent, I've placed our Meson builds in a diaper script to re-run on failure.

Data points:

  • direct CI launch: run: meson -C build --> 60 jobs, all failing in CI, not one passed
  • direct CI launch: run: ninja -C build --> 18 jobs failing in CI, not one passed
  • script-wrapped launch: run: scripts/retry_command.sh 2 meson compile -C build --> 18 back-to-back good runs, no retries needed.

What's interesting is that the retry logic is never needed, making me suspect that running inside a shell wrapper is altering the parent-level runtime space/behavior in a way that Python likes (perhaps stacksize or other low-level stuff inside MSYS2) versus the "direct-launch" that's giving it troubles.

Perhaps to reproduce this locally, you'd need to launch the CI YAML run: command using whatever mechanism GitHub is using. Perhaps their CI code is Go(?) or Python(?) launching command as a system-call (and probably wrapped in some security-hardened WSL Docker instance).

kcgen added a commit to dosbox-staging/dosbox-staging that referenced this issue Jun 18, 2022
Reason for retry is due to intermittent Python symbol extractor
failure under MSYS2 reported in bug:
msys2/MINGW-packages#11864
@lazka
Copy link
Member Author

lazka commented Jun 18, 2022

I've forward ported an old ninja PR (ninja-build/ninja#1805) to show the exit status and it's STATUS_ACCESS_VIOLATION. So that could either be the launcher crashing or python itself. I'd guess python itself (?).

@shermp
Copy link

shermp commented Jun 18, 2022

Perhaps to reproduce this locally, you'd need to launch the CI YAML run: command using whatever mechanism GitHub is using. Perhaps their CI code is Go(?) or Python(?) launching command as a system-call (and probably wrapped in some security-hardened WSL Docker instance).

They may be using either .NET or NodeJS. That's what their self hosted runner and actions are using respectively at least.

@naveen521kk
Copy link
Member

So that could either be the launcher crashing or python itself. I'd guess python itself (?).

Oh wow, interesting.

@kcgen
Copy link

kcgen commented Jun 19, 2022

A user running GDAL was hit with STATUS_ACCESS_VIOLATION on Windows, and configured the stack and environment settings prior to launch:

if __name__ == "__main__":
    if sys.platform[:3] == "win":
        sys.setrecursionlimit(10000000)
        threading.stack_size(200000000)
        thread = threading.Thread(target=driver)
        thread.start()

    # launch ...

https://stackoverflow.com/questions/41000945/why-do-i-get-a-status-access-violation-when-i-run-this-python-script-on-windows

kcgen added a commit to dosbox-staging/dosbox-staging that referenced this issue Jun 23, 2022
This is (hopefully) a temporary work-around for the intermittent
Python 3.10 symbol extractor failure under MSYS2, reported in
ticket: msys2/MINGW-packages#11864
@lb90
Copy link
Collaborator

lb90 commented Jun 25, 2022

ELI5? Some Meson internal commands fail randomly and it's unclear whether the problem is in Ninja, Meson or Python 3.10?

@lb90
Copy link
Collaborator

lb90 commented Jun 25, 2022

FWIF, the error in #11915 can be reproduced reliably, meaning that it's not random, it just happens anytime I run ninja. It may be useful to investigate this issue further

EDIT: maybe it's not related? The error message is: failed to load doclet 'D:/a/msys64/mingw64/lib/valadoc-0.56\doclets\devhelp'

@Biswa96
Copy link
Member

Biswa96 commented Jun 25, 2022

Yeah, the vala error is different topic. I can try to create a PR today to workaround that issue.

@lb90
Copy link
Collaborator

lb90 commented Jun 25, 2022

Ah, thanks! That would be great. I'm not well-versed in valadoc 🙂

@lb90
Copy link
Collaborator

lb90 commented Jun 25, 2022

@lazka
Copy link
Member Author

lazka commented Jun 26, 2022

So that could either be the launcher crashing or python itself. I'd guess python itself (?).

it's python itself..

edit: the faulthandler doesn't print any stacktraces, despite python crashing..

edit: I'm out of ideas now..

@anarazel
Copy link

anarazel commented Sep 9, 2022

@lazka:

Enabling JIT debugging (6ff1a1f) makes the error go away. But at this point I'm just happy to have a workaround for CI.

I suspect that might be because setting MSYS=winjitdebug causes the error-mode for children to be set to 0. Normally that's set to SEM_FAILCRITICALERRORS | SEM_NOOPENFILEERRORBOX (in normal windows at least). Not having SEM_FAILCRITICALERRORS will cause a bunch of behavioural differences, which could easily prevent the crash from occurring.

myadmin@win10-nojoin UCRT64 ~
$ python -c "import msvcrt; print(hex(msvcrt.GetErrorMode()))"
0x3

$ export MSYS=winjitdebug
$ exec bash

myadmin@win10-nojoin UCRT64 ~
$ python -c "import msvcrt; print(hex(msvcrt.GetErrorMode()))"
0x0

@anarazel
Copy link

Turns out it's actually the presence of SEM_NOOPENFILEERRORBOX - I can, very very occasionally, reproduce crashes after os._exit() and before process exit. So this looks very likely to be a microsoft C runtime (or kernel) bug. In my case no msys environment is involved, fwiw.

@lb90
Copy link
Collaborator

lb90 commented Sep 10, 2022

Could it be a DLL doing weird things from their DLL_PROCESS_DETACH then? (or the same from a TLS callback)

Actually in DLL_PROCESS_DETACH a DLL should check whether the process is exiting (lpvReserved != nullptr), and in that case do nothing at all. That's because the program state is undefined at that time:

Also documented on MSDN:

Unfortunately many DLLs don't check lpvReserved

@anarazel
Copy link

anarazel commented Sep 12, 2022

Could it be a DLL doing weird things from their DLL_PROCESS_DETACH then? (or the same from a TLS callback)

A quick grep through the python sources doesn't show a problematic DLL_PROCESS_DETACH callback that could be involved here. None of them do anything in the DLL_PROCESS_DETACH case. I didn't see any tls callbacks at all, but might be missing something (not a windows person).

Given the crash only happens with SEM_NOGPFAULTERRORBOX set it seems more likely to be a bug in the CRT or kernel to me, since that being set should lead to less application involvement.

Edit: I copy-pastoed SEM_NOOPENFILEERRORBOX instead of SEM_NOGPFAULTERRORBOX in an earlier version

@lb90
Copy link
Collaborator

lb90 commented Sep 12, 2022

I can, very very occasionally, reproduce crashes after os._exit() and before process exit

Can you reproduce locally? If so, may I ask how are you invoking python?

@anarazel
Copy link

Can you reproduce locally? If so, may I ask how are you invoking python?

Unfortunately only in CI, windows 2019 in my case. The method of invoking python doesn't seem to matter, it happens even when invoking via path/to/python.exe path/to/script.py

@jeremyd2019
Copy link
Member

jeremyd2019 commented Nov 7, 2022

Thought I'd try to correct the record a bit here since this got referenced again. error mode 0x3 is actually SEM_FAILCRITICALERRORS|SEM_NOGPFAULTERRORBOX. Both relate to not showing a message box if an error occurs. Unfortunately, MS hooked just-in-time debugging (and WER) to the "GP fault error box" path, so if you disable that you cannot get those either. This is why the option I added is called winjitdebug (because that's what I was trying to enable). Also, mode 0x0 should be the OS default (though the default is also to inherit from the parent process, so you can't necessarily assume what they will be).

https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-seterrormode#parameters

@anarazel
Copy link

anarazel commented Nov 7, 2022

Also, mode 0x0 should be the OS default (though the default is also to inherit from the parent process, so you can't necessarily assume what they will

Oh, indeed. I saw 0x8001 because I was using the 'terminal app' to start cmd, rather than just cmd.exe directly. Extremely unhelpful by terminal to start the "sub terminals" with a different error mode than normal...

Thought I'd try to correct the record a bit here since this got referenced again. error mode 0x3 is actually SEM_FAILCRITICALERRORS|SEM_NOGPFAULTERRORBOX.

The reference to SEM_FAILCRITICALERRORS | SEM_NOOPENFILEERRORBOX was not about the 0x3, but what I saw in the windows terminal. 0x3 / SEM_FAILCRITICALERRORS|SEM_NOGPFAULTERRORBOX is what some msys code sets when MSYS=winjitdebug isn't set.

@lb90
Copy link
Collaborator

lb90 commented Nov 7, 2022

Regarding this, how could one go about setting up Github action that can reproduce the error?

@vintagepc
Copy link

vintagepc commented Nov 7, 2022

It seems to happen fairly consistently with the MSYS build in my project repo linked in the referenced issue from 6h ago as it does a fairly involved Meson/Python build.

vintagepc/MINI404#116 has been giving me grief most recently. You're more than welcome to fork it and hack up the workflow to dig in.

@vintagepc
Copy link

I'll add in that I just tried force-downgrading to the mingw64 python 3.9 package and it seems to have made the problem go away - will rerun the workflow a few times to confirm.

gnomesysadmins pushed a commit to GNOME/glib that referenced this issue Feb 21, 2023
Suggested by Christoph Reiter, this is a workaround for random Python
crashes in Meson which only appear on this platform.

It’s being tracked upstream at
msys2/MINGW-packages#11864, but unfortunately
it seems hard to fix.

Work around the issue the same way that Meson have in their CI, by
enabling JIT debugging. See
https://gitlab.gnome.org/GNOME/glib/-/merge_requests/3280#note_1678973.

Signed-off-by: Philip Withnall <pwithnall@endlessos.org>
@kmilos kmilos mentioned this issue Mar 20, 2023
7 tasks
@lazka
Copy link
Member Author

lazka commented Jun 6, 2023

We've made some progress finding a potential cause here: #17415

With a bit of luck the update to 3.11 will fix this.

@lazka
Copy link
Member Author

lazka commented Jun 25, 2023

I'm closing this in favor of #17415

@lazka lazka closed this as completed Jun 25, 2023
@Biswa96 Biswa96 added the bug label Jun 25, 2023
kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
sfan5 pushed a commit to mpv-player/mpv that referenced this issue Jun 26, 2023
dyphire pushed a commit to dyphire/mpv that referenced this issue Jul 3, 2023
dyphire pushed a commit to dyphire/mpv that referenced this issue Jul 8, 2023
@lazka
Copy link
Member Author

lazka commented Jul 30, 2023

One last comment here: We've updated to Python 3.11 yesterday, so this should no longer be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants