Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOGIL: Make loop data cache and dispatch cache thread-safe in nogil build #26348

Merged
merged 3 commits into from Apr 30, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Apr 25, 2024

This disables the loop data cache and adds locking to the custom hash table storing the dispatch cache. As a result, this fixes seg faults and that can happen when these caches are corrupted by calling a ufunc simultaneously in multiple threads, as in the test I added.

This impacts single-threaded performance, but nothing catastrophic, see the comment below with up-to-date benchmarks.

@ngoldbaum ngoldbaum added the 39 - no-GIL PRs and issues related to support for free-threaded CPython (a.k.a. no-GIL, PEP 703) label Apr 25, 2024
@seberg
Copy link
Member

seberg commented Apr 26, 2024

Disabling the loop data cache doesn't seem to affect the ufunc benchmarks

There is really no need to disable these, I think? This is fully static data once created, so the only place where you should need a lock, is if there is a cache miss and you are creating it!?

The custom hash-map is a bit trickier, IIRC it gave a ~15% speed improvement at the time for small array ufunc calls. I am not really attached to it though, since it adds a bit of annoying complexity.
(I once hopeed that this would also help us clean up unused loops, but since Python is lacking the concept of ephemerons, cleanup is very restricted even if it made sense to bother about it...).

OTOH, I am starting to wonder: how hard would it be to vendor a simple lock for these types of cases (i.e. a read lock that only blocks writes and a write block that blocks everything, but we expect a write block to very rarely take place)!?

I feel a bit like this is may just the beginning, since at some point you probably have to think about locking the Array object (or object dtype) itself and you will need such a fast lock at that point?

@ngoldbaum
Copy link
Member Author

There is really no need to disable these, I think? This is fully static data once created, so the only place where you should need a lock, is if there is a cache miss and you are creating it!?

I didn't try to fully understand what was going on, but the failure mode was NPY_AUXDATA_FREE being called with a NULL pointer because another thread had already freed the auxdata from the cache.

how hard would it be to vendor a simple lock for these types of cases

I'm hoping that PyMutex will be public in the CPython C API in the next couple of weeks, then I can try experimenting with locking as an alternative approach for both of these caches as well as the allocation cache. See capi-workgroup/decisions#22 for the discussion on that.

Without doing something for these two caches then a lot of operations using numpy in parallel higher in the PyData stack don't work either, so merging this with a slowdown for single-threaded workflows should also unblock work on fixing issues in other libraries. In particular, this PR at least fixes a scikit-learn parallel forest training test and a scipy concurrent minimization test.

@ngoldbaum
Copy link
Member Author

@rgommers just pointed out to me I wasn't clear initially: both changes only affect single-threaded performance in the nogil build. All the changed code is guarded by ifdefs.

@seberg
Copy link
Member

seberg commented Apr 26, 2024

Yeah, I hadn't noticed that initially and then realized when you commented, changes should be fine with that in any case (but should look over the dispaching part briefly maybe).

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a bit hacky to have a dict path in a custom object here, but I think it is fine especially since all of these are really stop-gaps for having a better mutex available.

I do think the test code needs a small fix (I think this was a manual test function to inspect the casts). No need to bother to actually test it on my account though.

}
Py_INCREF(result);
// caller handles error and missing key cases
PyArrayIdentityHash_GetItemRef(tb, keys, &result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this isn't actually used, I think I used it for a more manual test maybe? But I think GetItemRef can set an error probably? So I think you might need a result == NULL: PyErr_Clear(); result = Py_NewRef(Py_None);?

PyTuple_GET_ITEM(info, 1), &PyArrayMethod_Type)) {
/* Found the ArrayMethod and NOT a promoter: return it */
// TODO properly manage refcounts, this is safe because
// PyArrayMethodObject instances are never decref'd otherwise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK, just to note that in theory NumPy could replace them, but in practice it cannot happen right now. I.e. the TODO may be actually be a very very subtle bug.

@mhvk
Copy link
Contributor

mhvk commented Apr 28, 2024

Ufuncs never get deallocated

I do not think we should assume this, unless we are absolutely sure we will never expose this cache to any ufuncs outside of numpy and will never write ufuncs that get deallocated ourselves (my https://github.com/mhvk/chain_ufunc certainly assumes short-lived ufuncs are a useful concept).



#ifndef Py_GIL_DISABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a comment here about the difference, and about how ideally we have a solution that works for both builds yet does not reduce performance when the GIL is enabled.

Note that ideally we indeed do not have our own hash tables - @seberg and I discussed other options at length when this code was added, but couldn't find an alternative that gave the same improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, about the comment: Maybe it makes sense to add something like:

/* Once cheap locking is exposed in Python, this should be used instead see gh-... */

Ufuncs never get deallocated

I do not think we should assume th

Yeah, we shouldn't. But, I think it was more of an observation here rather than very important for the change. I think the change actually prepares for making things more resiliant to an ArrayMethod getting deleted (ufunc lifetime shouldn't matter much, as it cannot be deallocateed while in execution).

Although, I guess the observation is still true in NumPy, so deallocating ufuncs is possible not tested (well), unfortunately.

@ngoldbaum
Copy link
Member Author

Thanks for looking this over! I realized I didn't have a good intuition for the performance impact of the dict solution vs a PyThread_type_lock or a PyMutex lock.

I spent some time on this yesterday and discovered that using PyThread_type_lock is faster in the single-threaded case than using a dict. The code change is also less invasive and I can keep the replacement check. I also tried with PyMutex and that is faster still. See:

https://gist.github.com/ngoldbaum/a858dbe28b90c1b06583af6fe6a45640

Take these performance measurements worth a grain of salt. I'm running this on my laptop and it has CPU throttling and other settings enabled that will impact this. For example, I'm pretty sure all of the reports in the PyMutex results are noise and the actual single-threaded impact of PyMutex is below the measurement threshold because I can't reproduce any of the changes when I run each benchmark individually.

I've written the version with PyThread_type_lock with a macro so switching to PyMutex should be easy. The only tricky part is PyThread_type_lock needs heap allocation and PyMutex doesn't, but switching to PyMutex should just requiring deleting the de-allocation and switching the allocation to a static initialization.

PyArrayIdentityHash_SetItem(tb, item+1, item[0], 1);
PyObject **tb_item = find_item(tb, item + 1);
tb_item[0] = item[0];
memcpy(tb_item+1, item+1, tb->key_len * sizeof(PyObject *));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids a recursive call to _resize_if_necessary and then acquiring the lock twice. It might also be a teeny bit faster.

#define FREE_LOCK(tb) \
if (tb->mutex != NULL) { \
PyThread_free_lock(tb->mutex); \
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These last two don't need to be macros but writing it this way lets me avoid adding more Py_GIL_DISABLED checks below.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather elegant! And looks all good, but in case someone else asks for changes, I'd recommend adding a comment like /* Does nothing if GIL enabled */ after INITIALIZE_TABLE(tb) and LOCK_TABLE(tb). Though I think it is really quiet clear, so if there are no other comments, that's not worth rerunning CI for.

@seberg
Copy link
Member

seberg commented Apr 30, 2024

Looks good to me as well, thanks for checking the mutex approach, Nathan.

@seberg seberg merged commit 05f8351 into numpy:main Apr 30, 2024
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
39 - no-GIL PRs and issues related to support for free-threaded CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants