You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PEP 703 (Making the Global Interpreter Lock Optional in CPython) was accepted in Oct 2023. Mypyc will require various changes to support CPython builds that don't use the GIL. We should experiment with the no-GIL build as soon as feasible once it's functional in the CPython main branch, since this is one of the biggest changes to the CPython runtime ever. Many existing and potential mypyc users would also likely want to use it.
Implementation
I haven't tried to think through all the implications carefully (experimentation will also be required), but here are some things that already seem likely:
We'll need to generate a fine-grained lock for each mutable native instance, and we need to at least acquire the lock on each attribute get/set, for attributes with pointer types.
There may be concurrent mutations happening in another thread, and if we just read a reference without locking, another thread might have freed the object before we can increment the reference count.
If an attribute is final, we can perhaps read it without locking.
Accessing mutable C statics needs locking. Also, initializing a final module-level attribute with non-trivial initializer may need locking.
We can't safely use borrowed references in some places where they are currently used, since we can't assume there are no concurrent mutations.
There are probably other changes.
Reference counting will become more expensive, and built-in containers, including list, dict and set would also now use fine-grained locks to protect most operations. These changes will happen behind the scenes if we use the C API. Direct access to struct fields of (at least non-immutable) built-in objects is perhaps unsafe, unless we a careful to always take the necessary locks.
Performance impact
Obviously, not having GIL should enable better performance in many multi-threaded workloads. This is the main benefit.
Sequential performance is expected to be slower due to the extra synchronization and other changes. The PEP suggests around 7%-8% overhead when running mostly interpreted workloads. For compiled workloads the impact may be bigger, since compilation may not reduce the number of slower operations as much as it reduces other overhead.
Here's a contrived example which highlights the above issue. Let's assume that all the overhead would be from reference counting, and compilation would speed up overall performance by 5x with the GIL. Also, let's assume that compiled code needs the exact same reference count manipulations as interpreted code. Now a 7% overhead for interpreted code could result in a 50% overhead in compiled code, since reference counting accounts for a much larger fraction of time spent in compiled code.
Multi-threaded code that uses packed arrays or numeric arrays could see very big benefits, as these could probably be accessed from multiple threads without fine-grained synchronization. Also, code that spends a lot of time in single-threaded C extensions (that don't use Python containers) could also benefit a lot.
Since mypy is single-threaded and uses lots of heap-allocated objects and built-in collections, it could experience a fairly high overhead.
Open issues / brainstorming
Do we want to preserve the exact concurrency semantics and atomicity guarantees of CPython in compiled code? It could be fairly expensive, due to the high level of fine-grained locking required.
One option would to avoid memory corruption, but use a somewhat relaxed memory model otherwise. For example, maybe access to mutable non-pointer values like i64 or float wouldn't require synchronization.
Could we allow access to final attributes (with pointer types) without locking while conforming to CPython semantics?
We'd need to use synchronized reference counting anyway, but maybe we don't need to take a lock on the object which has the attribute. This seems possible at least in some cases, if another thread can't see an uninitialized object, and can only see an initialized object after a memory barrier.
Would it be reasonable to access module-level finals with lazy initialization without locking? This seems unsafe, but maybe it's a deviation we could document as unsafe. Not sure if it would make much of a difference either way, though.
More generally, is it always safe to access immutable objects without locking?
Would it be useful to add some borrowing back when dealing with immutable expressions? For example, in a = x.y.z maybe we can still borrow x.y if it's a final attribute.
Would it be useful to merge fine-grained locks to reduce locking overhead? For example, in code like self.x = self.y + 's', maybe we'd take a single lock around self instead of locking separately for self.y and self.x.
I guess here we can only merge locks if the code can't run arbitrary code, or perform an arbitrary number of loop iterations. Here we'd need to know if string concatenation can run arbitrary code.
Within loops, would it make sense to only release the lock once every N iterations, if the loop body can't run arbitrary code?
Should we try to analyze which object references can never be seen by other threads, and skip locking when using these?
For example, if we build a list locally within a function, it's quite possible it can't leak outside the function until it's complete/freed. (We'll need to assume no gc.get_objects() in other threads, which seems acceptable.)
Let me know if there's anything I can help with. FYI, it's not yet feasible to experiment with the 3.13 --disable-gil builds -- no enough has been integrated yet. There's probably at least a few more months of work before testing with it is possible.
JukkaL
changed the title
Explore no-GIL support ("free threading")
Explore no-GIL support (free threading)
May 18, 2024
PEP 703 (Making the Global Interpreter Lock Optional in CPython) was accepted in Oct 2023. Mypyc will require various changes to support CPython builds that don't use the GIL. We should experiment with the no-GIL build as soon as feasible once it's functional in the CPython main branch, since this is one of the biggest changes to the CPython runtime ever. Many existing and potential mypyc users would also likely want to use it.
Implementation
I haven't tried to think through all the implications carefully (experimentation will also be required), but here are some things that already seem likely:
There are probably other changes.
Reference counting will become more expensive, and built-in containers, including
list
,dict
andset
would also now use fine-grained locks to protect most operations. These changes will happen behind the scenes if we use the C API. Direct access to struct fields of (at least non-immutable) built-in objects is perhaps unsafe, unless we a careful to always take the necessary locks.Performance impact
Obviously, not having GIL should enable better performance in many multi-threaded workloads. This is the main benefit.
Sequential performance is expected to be slower due to the extra synchronization and other changes. The PEP suggests around 7%-8% overhead when running mostly interpreted workloads. For compiled workloads the impact may be bigger, since compilation may not reduce the number of slower operations as much as it reduces other overhead.
Here's a contrived example which highlights the above issue. Let's assume that all the overhead would be from reference counting, and compilation would speed up overall performance by 5x with the GIL. Also, let's assume that compiled code needs the exact same reference count manipulations as interpreted code. Now a 7% overhead for interpreted code could result in a 50% overhead in compiled code, since reference counting accounts for a much larger fraction of time spent in compiled code.
Multi-threaded code that uses packed arrays or numeric arrays could see very big benefits, as these could probably be accessed from multiple threads without fine-grained synchronization. Also, code that spends a lot of time in single-threaded C extensions (that don't use Python containers) could also benefit a lot.
Since mypy is single-threaded and uses lots of heap-allocated objects and built-in collections, it could experience a fairly high overhead.
Open issues / brainstorming
i64
orfloat
wouldn't require synchronization.a = x.y.z
maybe we can still borrowx.y
if it's a final attribute.self.x = self.y + 's'
, maybe we'd take a single lock aroundself
instead of locking separately forself.y
andself.x
.gc.get_objects()
in other threads, which seems acceptable.)Useful links
Tasks
The text was updated successfully, but these errors were encountered: