Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

Open
vsnever opened this issue Sep 30, 2020 · 17 comments

Comments

@vsnever
Copy link
Member

vsnever commented Sep 30, 2020

Calling any caching or interpolating function that uses numpy.linalg.solve() if render_engine is set to MulticoreEngine() may cause severe drops in performance if numpy is built with OpenBLAS and OPENBLAS_NUM_THREADS environment variable is not set to 1. This happens because solve() also runs in parallel, and when it's called, we get N parallel processes, each of which starts N parallel threads.

The solution is to set OPENBLAS_NUM_THREADS = 1 or OMP_NUM_THREADS = 1 before running a simulation. I think this worth mentioning in the documentation, because the user not familiar with the code might not figure it out.

Affected demos: beam_into_slab.py, beam.py, plasma-and-beam.py

@Mateasek
Copy link
Member

Mateasek commented Sep 30, 2020

This is a nice discovery @vsnever! I ran into performance issues with beams a month ago. The beam calculations were taking ages when the observation was being done with MultiCoreEngine. I wasn't capable tracing the problem directly to the source, but I kind of figured out it could have to have something in common with caching. I am being pressed by a deadline, so I solved everything by running quick observation with a SerialEngine and then a long one with MultiCoreEndgine. I know the beam attenuator uses cumptrapz from scipy. Could that be the cause?

(just offtopic, for the attenuator the solution could be to use the integrate function available in Raysect.)

@vsnever
Copy link
Member Author

vsnever commented Sep 30, 2020

I discovered this when I tried to run a new SOLPS demo for generomak cherab/solps#39 by @jacklovell after a system upgrade I made today. So, it is not related to beams only.
It seems like cumptrapz does not run in parallel, but solve() does. solve() is used wherever polynomial interpolation is used.

@Mateasek, have you tried adding

import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'

in the beginning of your script?

In case you are running Anaconda python, replace it with:

import os
os.environ['MKL_NUM_THREADS'] = '1'

@Mateasek
Copy link
Member

I will definitely test it. I was suspecting the way multiprocessing in python works (at least what I managed to understand). I ran into this issue when working on a multiray beam model. When having a lot of beams it took ages to get to the point of raytracing. I suspected that the beam objects were coppied to the processes with empty caches and then each one of them had to cache (which includes reading of OpenADAS data) when emission was requested. Anyway I will test what you suggested (when I get time to look at it, I am quite overwhelmed by deadlines lately) and I will create a separate issue if it is something else. Anyway thanks @vsnever .

@jacklovell
Copy link
Member

jacklovell commented Oct 1, 2020

This is a general problem with Python's multiprocessing module and Numpy: we see it at Culham with codes completely unrelated to Cherab and Raysect too, when codes using multiprocessing to call a function in parallel using any of the numpy.linalg routines which call parallelised BLAS routines. At Culham we have various installations of Numpy compiled with ATLAS (which has no run-time multi-threading configuration at all), OpenBLAS and Intel MKL.

@vsnever
Copy link
Member Author

vsnever commented Oct 1, 2020

This is a general problem with Python's multiprocessing module and Numpy: we see it at Culham with codes completely unrelated to Cherab and Raysect too, when codes using multiprocessing to call a function in parallel using any of the numpy.linalg routines which make BLAS calls. At Culham we have various installations of Numpy compiled with ATLAS (which has no run-time multi-threading configuration at all), OpenBLAS and Intel MKL.

True. I figured out what the problem was quickly, only because I had already encountered this problem before with my own software. And still I didn't expect linalg calls in low-level functions. Even this is not a Cherab problem, the instructions should specify that Cherab must be built with a non-parallel version of numpy or that numpy must be set to run in a non-parallel mode via environment variables.

@jacklovell
Copy link
Member

A quick scan through the code base suggests this is only an issue in the caching functions. Although np.linalg.solve appears in the interpolators too, it's only used when building the interpolator so will not be called during a render.

The problem appears to be the call to solve in CachingxD._evaluate. I don't think there's a suitable substitute for this call either, short of including a hand-written implementation of dgesv to solve the system of equations.

We could push the problem over to raysect, and require that MultiCoreEngine(processes=n) would only ever use up to n cores. The render engine would then have to prevent BLAS multithreading. Interestingly, Numpy's documentation now suggests using the threadpoolctl package as a portable way to avoid these sorts of multiprocessing/multithreading problems: this would be a pretty simple way of fixing the immediate issue.

@jacklovell
Copy link
Member

jacklovell commented Oct 1, 2020

A quick scan through the code base suggests this is only an issue in the caching functions. Although np.linalg.solve appears in the interpolators too, it's only used when building the interpolator so will not be called during a render.

Actually, this is only the case for the 1D interpolators. The 2D and 3D interpolators do have this problem. But #50 has already been proposed to fix this.

@vsnever
Copy link
Member Author

vsnever commented Oct 1, 2020

For some reason the threadpoolctl solution is not working for me. It fails to introspect numpy to get thread-pools description and also fails to limit the number of threads at runtime.

@CnlPepper
Copy link
Member

CnlPepper commented Oct 2, 2020

Interesting discovery. From a design standpoint, the openblas library is very strongly at fault here, it should not be generating threads unless explicitly asked to (multiprocess applications are common and should be permitted to manage their own resources). I feel distinctly uncomfortable working around a fault in another library in Raysect. The environment variable would be the preferred, if irritating approach.

I was half way through writing a new set of interpolators in Raysect that don't require solve() etc... There is actually no need to use solve anywhere. The student who wrote these classes missed that the matrix can be pre-inverted and then multiplied as appropriate. Unfortunately I really have no time to finish the Raysect interpolators at present. I suspect I won't be able to get back to raysect for a few months due to my current workload.

btw the new interpolators include a much needed constrained cubic that doesn't extrapolate beyond the points - perfect for spectra as it won't cross zero if the data doesn't.

@jacklovell
Copy link
Member

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

@Mateasek
Copy link
Member

Mateasek commented Oct 5, 2020

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

I agree with @jacklovell, this seems to be important and if it will be in my capabilities I will be happy to help.

@CnlPepper
Copy link
Member

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

They would fix the issues and that is the branch I was working in. I may have a quick look over the code this weekend and have a look at how much work remains to be done. If I remember correctly the cubic coefficient calculation is done, just the differentials and the data handling infrastructure need implementing. Adding the extrapolation and differentials that are present in the cherab interpolators would be a 2nd phase of work. It is a fair bit of tedious development, but not hugely complex. Writing the tests will be quite challenging though. That would be a good area of work for someone else to focus on, the tests can be setup before the code is complete. See the cherab test setup as a reference for what I would like rebuilt.

@vsnever
Copy link
Member Author

vsnever commented Nov 12, 2020

I will also be happy to help with the new interpolators and caching functions. By the way, the current implementation of caching uses general find_index routine ignoring the fact that the function is sampled on a regular grid. This is not an optimal way, because find_index iterates over the grid, which is not required if the grid is regularly spaced.

Until the fix is ready, should we add a workaround in all affected demos?
Something like:

from os import environ
environ['OMP_NUM_THREADS'] = '1'
environ['OPENBLAS_NUM_THREADS'] = '1'
environ['MKL_NUM_THREADS'] = '1'
environ['NUMEXPR_NUM_THREADS'] = '1'
environ['VECLIB_MAXIMUM_THREADS'] = '1'

in the beginning of the scripts.

@jacklovell
Copy link
Member

I don't think we should pollute the demos with additional environment variables: it adds noise and makes them more confusing for new users. The demos produce the correct output at the moment and performance will be sufficient for most use cases. If users find the performance degradation too debilitating we can point them to this issue and the workarounds contained in it until we can move away from linalg.solve in the interpolators.

@CnlPepper perhaps we could pin this issue on Github so that it's easily visible to others?

@vsnever
Copy link
Member Author

vsnever commented Nov 12, 2020

The demos produce the correct output at the moment and performance will be sufficient for most use cases.

The performance will depend on the number of parallel threads per CPU core. I ran these demos on a 12-core CPU with 24 parallel threads, which means that when a linalg.solve() was called, the 242 threads were running in parallel. This just hanged my system, and I had to stop the execution or it would never have finished. So maybe instead of setting the environment variables we'll explicitly limit the number of parallel threads to 4 in these demos: MulticoreEngine(4). Or just use the SerialEngine().

@CnlPepper
Copy link
Member

CnlPepper commented Nov 12, 2020

@jacklovell I think you should have permission to pin issues. I think it would be a good idea.

It would also recommend adding a note to the repository readme and the documentation.

@vsnever the cache objects were rapidly cobbled together by a student, there are a number of inefficiencies that need addressing. The new interpolation methods in raysect should make it easier to completely rebuild these.

@jacklovell
Copy link
Member

I've created #363 to re-write the caching objects using the new utilities provided by raysect for the interpolation. Once that's approved and we've deprecated the original Cherab interpolators we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants