-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233
Comments
This is a nice discovery @vsnever! I ran into performance issues with beams a month ago. The beam calculations were taking ages when the observation was being done with MultiCoreEngine. I wasn't capable tracing the problem directly to the source, but I kind of figured out it could have to have something in common with caching. I am being pressed by a deadline, so I solved everything by running quick observation with a SerialEngine and then a long one with MultiCoreEndgine. I know the beam attenuator uses cumptrapz from scipy. Could that be the cause? (just offtopic, for the attenuator the solution could be to use the integrate function available in Raysect.) |
I discovered this when I tried to run a new SOLPS demo for generomak cherab/solps#39 by @jacklovell after a system upgrade I made today. So, it is not related to beams only. @Mateasek, have you tried adding
in the beginning of your script? In case you are running Anaconda python, replace it with:
|
I will definitely test it. I was suspecting the way multiprocessing in python works (at least what I managed to understand). I ran into this issue when working on a multiray beam model. When having a lot of beams it took ages to get to the point of raytracing. I suspected that the beam objects were coppied to the processes with empty caches and then each one of them had to cache (which includes reading of OpenADAS data) when emission was requested. Anyway I will test what you suggested (when I get time to look at it, I am quite overwhelmed by deadlines lately) and I will create a separate issue if it is something else. Anyway thanks @vsnever . |
This is a general problem with Python's multiprocessing module and Numpy: we see it at Culham with codes completely unrelated to Cherab and Raysect too, when codes using multiprocessing to call a function in parallel using any of the numpy.linalg routines which call parallelised BLAS routines. At Culham we have various installations of Numpy compiled with ATLAS (which has no run-time multi-threading configuration at all), OpenBLAS and Intel MKL. |
True. I figured out what the problem was quickly, only because I had already encountered this problem before with my own software. And still I didn't expect |
A quick scan through the code base suggests this is only an issue in the caching functions. Although np.linalg.solve appears in the interpolators too, it's only used when building the interpolator so will not be called during a render. The problem appears to be the call to We could push the problem over to raysect, and require that |
Actually, this is only the case for the 1D interpolators. The 2D and 3D interpolators do have this problem. But #50 has already been proposed to fix this. |
For some reason the threadpoolctl solution is not working for me. It fails to introspect numpy to get thread-pools description and also fails to limit the number of threads at runtime. |
Interesting discovery. From a design standpoint, the openblas library is very strongly at fault here, it should not be generating threads unless explicitly asked to (multiprocess applications are common and should be permitted to manage their own resources). I feel distinctly uncomfortable working around a fault in another library in Raysect. The environment variable would be the preferred, if irritating approach. I was half way through writing a new set of interpolators in Raysect that don't require solve() etc... There is actually no need to use solve anywhere. The student who wrote these classes missed that the matrix can be pre-inverted and then multiplied as appropriate. Unfortunately I really have no time to finish the Raysect interpolators at present. I suspect I won't be able to get back to raysect for a few months due to my current workload. btw the new interpolators include a much needed constrained cubic that doesn't extrapolate beyond the points - perfect for spectra as it won't cross zero if the data doesn't. |
It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects. |
I agree with @jacklovell, this seems to be important and if it will be in my capabilities I will be happy to help. |
They would fix the issues and that is the branch I was working in. I may have a quick look over the code this weekend and have a look at how much work remains to be done. If I remember correctly the cubic coefficient calculation is done, just the differentials and the data handling infrastructure need implementing. Adding the extrapolation and differentials that are present in the cherab interpolators would be a 2nd phase of work. It is a fair bit of tedious development, but not hugely complex. Writing the tests will be quite challenging though. That would be a good area of work for someone else to focus on, the tests can be setup before the code is complete. See the cherab test setup as a reference for what I would like rebuilt. |
I will also be happy to help with the new interpolators and caching functions. By the way, the current implementation of caching uses general Until the fix is ready, should we add a workaround in all affected demos? from os import environ
environ['OMP_NUM_THREADS'] = '1'
environ['OPENBLAS_NUM_THREADS'] = '1'
environ['MKL_NUM_THREADS'] = '1'
environ['NUMEXPR_NUM_THREADS'] = '1'
environ['VECLIB_MAXIMUM_THREADS'] = '1' in the beginning of the scripts. |
I don't think we should pollute the demos with additional environment variables: it adds noise and makes them more confusing for new users. The demos produce the correct output at the moment and performance will be sufficient for most use cases. If users find the performance degradation too debilitating we can point them to this issue and the workarounds contained in it until we can move away from linalg.solve in the interpolators. @CnlPepper perhaps we could pin this issue on Github so that it's easily visible to others? |
The performance will depend on the number of parallel threads per CPU core. I ran these demos on a 12-core CPU with 24 parallel threads, which means that when a |
@jacklovell I think you should have permission to pin issues. I think it would be a good idea. It would also recommend adding a note to the repository readme and the documentation. @vsnever the cache objects were rapidly cobbled together by a student, there are a number of inefficiencies that need addressing. The new interpolation methods in raysect should make it easier to completely rebuild these. |
I've created #363 to re-write the caching objects using the new utilities provided by raysect for the interpolation. Once that's approved and we've deprecated the original Cherab interpolators we can close this issue. |
Calling any caching or interpolating function that uses
numpy.linalg.solve()
ifrender_engine
is set toMulticoreEngine()
may cause severe drops in performance if numpy is built with OpenBLAS andOPENBLAS_NUM_THREADS
environment variable is not set to 1. This happens becausesolve()
also runs in parallel, and when it's called, we get N parallel processes, each of which starts N parallel threads.The solution is to set
OPENBLAS_NUM_THREADS = 1
orOMP_NUM_THREADS = 1
before running a simulation. I think this worth mentioning in the documentation, because the user not familiar with the code might not figure it out.Affected demos: beam_into_slab.py, beam.py, plasma-and-beam.py
The text was updated successfully, but these errors were encountered: