Explore no-GIL support (free threading) #1038

JukkaL · 2023-11-26T16:10:18Z

PEP 703 (Making the Global Interpreter Lock Optional in CPython) was accepted in Oct 2023. Mypyc will require various changes to support CPython builds that don't use the GIL. We should experiment with the no-GIL build as soon as feasible once it's functional in the CPython main branch, since this is one of the biggest changes to the CPython runtime ever. Many existing and potential mypyc users would also likely want to use it.

Implementation

I haven't tried to think through all the implications carefully (experimentation will also be required), but here are some things that already seem likely:

We'll need to generate a fine-grained lock for each mutable native instance, and we need to at least acquire the lock on each attribute get/set, for attributes with pointer types.
- There may be concurrent mutations happening in another thread, and if we just read a reference without locking, another thread might have freed the object before we can increment the reference count.
- If an attribute is final, we can perhaps read it without locking.
Accessing mutable C statics needs locking. Also, initializing a final module-level attribute with non-trivial initializer may need locking.
We can't safely use borrowed references in some places where they are currently used, since we can't assume there are no concurrent mutations.

There are probably other changes.

Reference counting will become more expensive, and built-in containers, including list, dict and set would also now use fine-grained locks to protect most operations. These changes will happen behind the scenes if we use the C API. Direct access to struct fields of (at least non-immutable) built-in objects is perhaps unsafe, unless we a careful to always take the necessary locks.

Performance impact

Obviously, not having GIL should enable better performance in many multi-threaded workloads. This is the main benefit.

Sequential performance is expected to be slower due to the extra synchronization and other changes. The PEP suggests around 7%-8% overhead when running mostly interpreted workloads. For compiled workloads the impact may be bigger, since compilation may not reduce the number of slower operations as much as it reduces other overhead.

Here's a contrived example which highlights the above issue. Let's assume that all the overhead would be from reference counting, and compilation would speed up overall performance by 5x with the GIL. Also, let's assume that compiled code needs the exact same reference count manipulations as interpreted code. Now a 7% overhead for interpreted code could result in a 50% overhead in compiled code, since reference counting accounts for a much larger fraction of time spent in compiled code.

Multi-threaded code that uses packed arrays or numeric arrays could see very big benefits, as these could probably be accessed from multiple threads without fine-grained synchronization. Also, code that spends a lot of time in single-threaded C extensions (that don't use Python containers) could also benefit a lot.

Since mypy is single-threaded and uses lots of heap-allocated objects and built-in collections, it could experience a fairly high overhead.

Open issues / brainstorming

Do we want to preserve the exact concurrency semantics and atomicity guarantees of CPython in compiled code? It could be fairly expensive, due to the high level of fine-grained locking required.
- One option would to avoid memory corruption, but use a somewhat relaxed memory model otherwise. For example, maybe access to mutable non-pointer values like i64 or float wouldn't require synchronization.
Could we allow access to final attributes (with pointer types) without locking while conforming to CPython semantics?
- We'd need to use synchronized reference counting anyway, but maybe we don't need to take a lock on the object which has the attribute. This seems possible at least in some cases, if another thread can't see an uninitialized object, and can only see an initialized object after a memory barrier.
Would it be reasonable to access module-level finals with lazy initialization without locking? This seems unsafe, but maybe it's a deviation we could document as unsafe. Not sure if it would make much of a difference either way, though.
More generally, is it always safe to access immutable objects without locking?
Would it be useful to add some borrowing back when dealing with immutable expressions? For example, in a = x.y.z maybe we can still borrow x.y if it's a final attribute.
Would it be useful to merge fine-grained locks to reduce locking overhead? For example, in code like self.x = self.y + 's', maybe we'd take a single lock around self instead of locking separately for self.y and self.x.
- I guess here we can only merge locks if the code can't run arbitrary code, or perform an arbitrary number of loop iterations. Here we'd need to know if string concatenation can run arbitrary code.
- Within loops, would it make sense to only release the lock once every N iterations, if the loop body can't run arbitrary code?
Should we try to analyze which object references can never be seen by other threads, and skip locking when using these?
- For example, if we build a list locally within a function, it's quite possible it can't leak outside the function until it's complete/freed. (We'll need to assume no gc.get_objects() in other threads, which seems acceptable.)

Useful links

PEP 703 -- Making the Global Interpreter Lock Optional in CPython python/cpython#108219 (for tracking progress in CPython)

Tasks

Get mypyc working with recent Python 3.13 alpha/beta (with GIL)
Wait until the no-GIL build is at least somewhat functional on CPython main branch
Try to get some benchmarks to run with the no-GIL build
Try to get compiled mypy to run with the no-GIL build, and measure performance
Run some multithreaded benchmarks

The text was updated successfully, but these errors were encountered:

JukkaL · 2023-11-26T16:10:34Z

cc @msullivan @ilevkivskyi

colesbury · 2023-11-29T18:51:18Z

Let me know if there's anything I can help with. FYI, it's not yet feasible to experiment with the 3.13 --disable-gil builds -- no enough has been integrated yet. There's probably at least a few more months of work before testing with it is possible.

stonebig · 2024-05-20T12:59:30Z

hi,

At the moment I can build even a free-threading binary wheel with pip-24.1b1 msvc_runtime-14.38.33135-cp313-cp313t-win_amd64.whl

But then mypyc fails on me. if try to accelerate a pure python file.

With the basic error

WPy64-31300b1b\python-3.13.0b1.amd64\include\internal/pycore_frame.h(8): fatal error C1189: #error:  "this header requires Py_BUILD_CORE define"

JukkaL added the feature Supporting previously unsupported Python, new native types, new features, etc. label Nov 26, 2023

JukkaL changed the title ~~Explore no-GIL support ("free threading")~~ Explore no-GIL support (free threading) May 18, 2024

JukkaL mentioned this issue May 18, 2024

Development focus areas for 2024 #785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore no-GIL support (free threading) #1038

Explore no-GIL support (free threading) #1038

JukkaL commented Nov 26, 2023

JukkaL commented Nov 26, 2023

colesbury commented Nov 29, 2023

stonebig commented May 20, 2024 •

edited

Explore no-GIL support (free threading) #1038

Explore no-GIL support (free threading) #1038

Comments

JukkaL commented Nov 26, 2023

Implementation

Performance impact

Open issues / brainstorming

Useful links

Tasks

JukkaL commented Nov 26, 2023

colesbury commented Nov 29, 2023

stonebig commented May 20, 2024 • edited

stonebig commented May 20, 2024 •

edited