NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

ngoldbaum · 2024-04-25T21:09:55Z

This disables the loop data cache and adds locking to the custom hash table storing the dispatch cache. As a result, this fixes seg faults and that can happen when these caches are corrupted by calling a ufunc simultaneously in multiple threads, as in the test I added.

This impacts single-threaded performance, but nothing catastrophic, see the comment below with up-to-date benchmarks.

seberg · 2024-04-26T07:46:27Z

Disabling the loop data cache doesn't seem to affect the ufunc benchmarks

There is really no need to disable these, I think? This is fully static data once created, so the only place where you should need a lock, is if there is a cache miss and you are creating it!?

The custom hash-map is a bit trickier, IIRC it gave a ~15% speed improvement at the time for small array ufunc calls. I am not really attached to it though, since it adds a bit of annoying complexity.
(I once hopeed that this would also help us clean up unused loops, but since Python is lacking the concept of ephemerons, cleanup is very restricted even if it made sense to bother about it...).

OTOH, I am starting to wonder: how hard would it be to vendor a simple lock for these types of cases (i.e. a read lock that only blocks writes and a write block that blocks everything, but we expect a write block to very rarely take place)!?

I feel a bit like this is may just the beginning, since at some point you probably have to think about locking the Array object (or object dtype) itself and you will need such a fast lock at that point?

ngoldbaum · 2024-04-26T14:57:35Z

There is really no need to disable these, I think? This is fully static data once created, so the only place where you should need a lock, is if there is a cache miss and you are creating it!?

I didn't try to fully understand what was going on, but the failure mode was NPY_AUXDATA_FREE being called with a NULL pointer because another thread had already freed the auxdata from the cache.

how hard would it be to vendor a simple lock for these types of cases

I'm hoping that PyMutex will be public in the CPython C API in the next couple of weeks, then I can try experimenting with locking as an alternative approach for both of these caches as well as the allocation cache. See capi-workgroup/decisions#22 for the discussion on that.

Without doing something for these two caches then a lot of operations using numpy in parallel higher in the PyData stack don't work either, so merging this with a slowdown for single-threaded workflows should also unblock work on fixing issues in other libraries. In particular, this PR at least fixes a scikit-learn parallel forest training test and a scipy concurrent minimization test.

ngoldbaum · 2024-04-26T16:17:11Z

@rgommers just pointed out to me I wasn't clear initially: both changes only affect single-threaded performance in the nogil build. All the changed code is guarded by ifdefs.

seberg · 2024-04-26T16:45:35Z

Yeah, I hadn't noticed that initially and then realized when you commented, changes should be fine with that in any case (but should look over the dispaching part briefly maybe).

seberg

OK, a bit hacky to have a dict path in a custom object here, but I think it is fine especially since all of these are really stop-gaps for having a better mutex available.

I do think the test code needs a small fix (I think this was a manual test function to inspect the casts). No need to bother to actually test it on my account though.

seberg · 2024-04-27T15:15:19Z

numpy/_core/src/multiarray/_multiarray_tests.c.src

-            }
-            Py_INCREF(result);
+            // caller handles error and missing key cases
+            PyArrayIdentityHash_GetItemRef(tb, keys, &result);


I guess this isn't actually used, I think I used it for a more manual test maybe? But I think GetItemRef can set an error probably? So I think you might need a result == NULL: PyErr_Clear(); result = Py_NewRef(Py_None);?

seberg · 2024-04-27T15:16:45Z

numpy/_core/src/umath/dispatching.c

            PyTuple_GET_ITEM(info, 1), &PyArrayMethod_Type)) {
        /* Found the ArrayMethod and NOT a promoter: return it */
+        // TODO properly manage refcounts, this is safe because
+        // PyArrayMethodObject instances are never decref'd otherwise


This is OK, just to note that in theory NumPy could replace them, but in practice it cannot happen right now. I.e. the TODO may be actually be a very very subtle bug.

mhvk · 2024-04-28T13:03:06Z

Ufuncs never get deallocated

I do not think we should assume this, unless we are absolutely sure we will never expose this cache to any ufuncs outside of numpy and will never write ufuncs that get deallocated ourselves (my https://github.com/mhvk/chain_ufunc certainly assumes short-lived ufuncs are a useful concept).

mhvk · 2024-04-28T13:12:13Z

numpy/_core/src/common/npy_hashtable.c


-
+#ifndef Py_GIL_DISABLED


I think we should have a comment here about the difference, and about how ideally we have a solution that works for both builds yet does not reduce performance when the GIL is enabled.

Note that ideally we indeed do not have our own hash tables - @seberg and I discussed other options at length when this code was added, but couldn't find an alternative that gave the same improvement.

Right, about the comment: Maybe it makes sense to add something like:

/* Once cheap locking is exposed in Python, this should be used instead see gh-... */

Ufuncs never get deallocated

I do not think we should assume th

Yeah, we shouldn't. But, I think it was more of an observation here rather than very important for the change. I think the change actually prepares for making things more resiliant to an ArrayMethod getting deleted (ufunc lifetime shouldn't matter much, as it cannot be deallocateed while in execution).

Although, I guess the observation is still true in NumPy, so deallocating ufuncs is possible not tested (well), unfortunately.

ngoldbaum · 2024-04-30T18:37:30Z

Thanks for looking this over! I realized I didn't have a good intuition for the performance impact of the dict solution vs a PyThread_type_lock or a PyMutex lock.

I spent some time on this yesterday and discovered that using PyThread_type_lock is faster in the single-threaded case than using a dict. The code change is also less invasive and I can keep the replacement check. I also tried with PyMutex and that is faster still. See:

https://gist.github.com/ngoldbaum/a858dbe28b90c1b06583af6fe6a45640

Take these performance measurements worth a grain of salt. I'm running this on my laptop and it has CPU throttling and other settings enabled that will impact this. For example, I'm pretty sure all of the reports in the PyMutex results are noise and the actual single-threaded impact of PyMutex is below the measurement threshold because I can't reproduce any of the changes when I run each benchmark individually.

I've written the version with PyThread_type_lock with a macro so switching to PyMutex should be easy. The only tricky part is PyThread_type_lock needs heap allocation and PyMutex doesn't, but switching to PyMutex should just requiring deleting the de-allocation and switching the allocation to a static initialization.

ngoldbaum · 2024-04-30T18:39:02Z

numpy/_core/src/common/npy_hashtable.c

-            PyArrayIdentityHash_SetItem(tb, item+1, item[0], 1);
+            PyObject **tb_item = find_item(tb, item + 1);
+            tb_item[0] = item[0];
+            memcpy(tb_item+1, item+1, tb->key_len * sizeof(PyObject *));


This avoids a recursive call to _resize_if_necessary and then acquiring the lock twice. It might also be a teeny bit faster.

ngoldbaum · 2024-04-30T18:45:10Z

numpy/_core/src/common/npy_hashtable.c

+#define FREE_LOCK(tb)                           \
+    if (tb->mutex != NULL) {                    \
+        PyThread_free_lock(tb->mutex);          \
+    }


These last two don't need to be macros but writing it this way lets me avoid adding more Py_GIL_DISABLED checks below.

mhvk

This is rather elegant! And looks all good, but in case someone else asks for changes, I'd recommend adding a comment like /* Does nothing if GIL enabled */ after INITIALIZE_TABLE(tb) and LOCK_TABLE(tb). Though I think it is really quiet clear, so if there are no other comments, that's not worth rerunning CI for.

seberg · 2024-04-30T20:19:29Z

Looks good to me as well, thanks for checking the mutex approach, Nathan.

ngoldbaum added the 39 - no-GIL PRs and issues related to support for free-threaded CPython (a.k.a. no-GIL, PEP 703) label Apr 25, 2024

ngoldbaum mentioned this pull request Apr 25, 2024

ENH: Support nogil python build (tracking issue) #26157

Open

8 tasks

seberg approved these changes Apr 27, 2024

View reviewed changes

mhvk reviewed Apr 28, 2024

View reviewed changes

andfoy mentioned this pull request Apr 29, 2024

BUG: prevent QHull message stream being closed twice scipy/scipy#20611

Merged

ngoldbaum added 3 commits April 29, 2024 13:20

BUG: disable the loop data cache in the nogil build

2fd11c0

TST: add multithreaded ufunc execution test

6261524

MNT: add locking for PyArrayIdentityHash

86e39a0

ngoldbaum force-pushed the nogil-ufunc-caches branch from 68df7e4 to 86e39a0 Compare April 30, 2024 18:37

ngoldbaum commented Apr 30, 2024

View reviewed changes

mhvk approved these changes Apr 30, 2024

View reviewed changes

seberg merged commit 05f8351 into numpy:main Apr 30, 2024
65 checks passed

andfoy mentioned this pull request May 8, 2024

Tracker: Support Python nogil builds scipy/scipy#20669

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

ngoldbaum commented Apr 25, 2024 •

edited

seberg commented Apr 26, 2024

ngoldbaum commented Apr 26, 2024

ngoldbaum commented Apr 26, 2024

seberg commented Apr 26, 2024

seberg left a comment

seberg Apr 27, 2024

seberg Apr 27, 2024

mhvk commented Apr 28, 2024

mhvk Apr 28, 2024

seberg Apr 28, 2024

ngoldbaum commented Apr 30, 2024

ngoldbaum Apr 30, 2024

ngoldbaum Apr 30, 2024

mhvk left a comment

seberg commented Apr 30, 2024



		#ifndef Py_GIL_DISABLED

NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

Conversation

ngoldbaum commented Apr 25, 2024 • edited

seberg commented Apr 26, 2024

ngoldbaum commented Apr 26, 2024

ngoldbaum commented Apr 26, 2024

seberg commented Apr 26, 2024

seberg left a comment

Choose a reason for hiding this comment

seberg Apr 27, 2024

Choose a reason for hiding this comment

seberg Apr 27, 2024

Choose a reason for hiding this comment

mhvk commented Apr 28, 2024

mhvk Apr 28, 2024

Choose a reason for hiding this comment

seberg Apr 28, 2024

Choose a reason for hiding this comment

ngoldbaum commented Apr 30, 2024

ngoldbaum Apr 30, 2024

Choose a reason for hiding this comment

ngoldbaum Apr 30, 2024

Choose a reason for hiding this comment

mhvk left a comment

Choose a reason for hiding this comment

seberg commented Apr 30, 2024

ngoldbaum commented Apr 25, 2024 •

edited