Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

vsnever · 2020-09-30T17:44:13Z

Calling any caching or interpolating function that uses numpy.linalg.solve() if render_engine is set to MulticoreEngine() may cause severe drops in performance if numpy is built with OpenBLAS and OPENBLAS_NUM_THREADS environment variable is not set to 1. This happens because solve() also runs in parallel, and when it's called, we get N parallel processes, each of which starts N parallel threads.

The solution is to set OPENBLAS_NUM_THREADS = 1 or OMP_NUM_THREADS = 1 before running a simulation. I think this worth mentioning in the documentation, because the user not familiar with the code might not figure it out.

Affected demos: beam_into_slab.py, beam.py, plasma-and-beam.py

The text was updated successfully, but these errors were encountered:

Mateasek · 2020-09-30T18:27:04Z

This is a nice discovery @vsnever! I ran into performance issues with beams a month ago. The beam calculations were taking ages when the observation was being done with MultiCoreEngine. I wasn't capable tracing the problem directly to the source, but I kind of figured out it could have to have something in common with caching. I am being pressed by a deadline, so I solved everything by running quick observation with a SerialEngine and then a long one with MultiCoreEndgine. I know the beam attenuator uses cumptrapz from scipy. Could that be the cause?

(just offtopic, for the attenuator the solution could be to use the integrate function available in Raysect.)

vsnever · 2020-09-30T18:54:59Z

I discovered this when I tried to run a new SOLPS demo for generomak cherab/solps#39 by @jacklovell after a system upgrade I made today. So, it is not related to beams only.
It seems like cumptrapz does not run in parallel, but solve() does. solve() is used wherever polynomial interpolation is used.

@Mateasek, have you tried adding

import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'

in the beginning of your script?

In case you are running Anaconda python, replace it with:

import os
os.environ['MKL_NUM_THREADS'] = '1'

Mateasek · 2020-09-30T19:09:21Z

I will definitely test it. I was suspecting the way multiprocessing in python works (at least what I managed to understand). I ran into this issue when working on a multiray beam model. When having a lot of beams it took ages to get to the point of raytracing. I suspected that the beam objects were coppied to the processes with empty caches and then each one of them had to cache (which includes reading of OpenADAS data) when emission was requested. Anyway I will test what you suggested (when I get time to look at it, I am quite overwhelmed by deadlines lately) and I will create a separate issue if it is something else. Anyway thanks @vsnever .

jacklovell · 2020-10-01T09:44:15Z

This is a general problem with Python's multiprocessing module and Numpy: we see it at Culham with codes completely unrelated to Cherab and Raysect too, when codes using multiprocessing to call a function in parallel using any of the numpy.linalg routines which call parallelised BLAS routines. At Culham we have various installations of Numpy compiled with ATLAS (which has no run-time multi-threading configuration at all), OpenBLAS and Intel MKL.

vsnever · 2020-10-01T10:48:20Z

This is a general problem with Python's multiprocessing module and Numpy: we see it at Culham with codes completely unrelated to Cherab and Raysect too, when codes using multiprocessing to call a function in parallel using any of the numpy.linalg routines which make BLAS calls. At Culham we have various installations of Numpy compiled with ATLAS (which has no run-time multi-threading configuration at all), OpenBLAS and Intel MKL.

True. I figured out what the problem was quickly, only because I had already encountered this problem before with my own software. And still I didn't expect linalg calls in low-level functions. Even this is not a Cherab problem, the instructions should specify that Cherab must be built with a non-parallel version of numpy or that numpy must be set to run in a non-parallel mode via environment variables.

jacklovell · 2020-10-01T13:19:20Z

A quick scan through the code base suggests this is only an issue in the caching functions. Although np.linalg.solve appears in the interpolators too, it's only used when building the interpolator so will not be called during a render.

The problem appears to be the call to solve in CachingxD._evaluate. I don't think there's a suitable substitute for this call either, short of including a hand-written implementation of dgesv to solve the system of equations.

We could push the problem over to raysect, and require that MultiCoreEngine(processes=n) would only ever use up to n cores. The render engine would then have to prevent BLAS multithreading. Interestingly, Numpy's documentation now suggests using the threadpoolctl package as a portable way to avoid these sorts of multiprocessing/multithreading problems: this would be a pretty simple way of fixing the immediate issue.

jacklovell · 2020-10-01T13:44:45Z

A quick scan through the code base suggests this is only an issue in the caching functions. Although np.linalg.solve appears in the interpolators too, it's only used when building the interpolator so will not be called during a render.

Actually, this is only the case for the 1D interpolators. The 2D and 3D interpolators do have this problem. But #50 has already been proposed to fix this.

vsnever · 2020-10-01T14:28:18Z

For some reason the threadpoolctl solution is not working for me. It fails to introspect numpy to get thread-pools description and also fails to limit the number of threads at runtime.

CnlPepper · 2020-10-02T17:47:06Z

Interesting discovery. From a design standpoint, the openblas library is very strongly at fault here, it should not be generating threads unless explicitly asked to (multiprocess applications are common and should be permitted to manage their own resources). I feel distinctly uncomfortable working around a fault in another library in Raysect. The environment variable would be the preferred, if irritating approach.

I was half way through writing a new set of interpolators in Raysect that don't require solve() etc... There is actually no need to use solve anywhere. The student who wrote these classes missed that the matrix can be pre-inverted and then multiplied as appropriate. Unfortunately I really have no time to finish the Raysect interpolators at present. I suspect I won't be able to get back to raysect for a few months due to my current workload.

btw the new interpolators include a much needed constrained cubic that doesn't extrapolate beyond the points - perfect for spectra as it won't cross zero if the data doesn't.

jacklovell · 2020-10-05T16:23:47Z

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

Mateasek · 2020-10-05T22:43:17Z

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

I agree with @jacklovell, this seems to be important and if it will be in my capabilities I will be happy to help.

CnlPepper · 2020-10-09T13:32:38Z

It sounds like the new interpolators would fix both this and #50 then. Is this the work going on in the raysect/source/feature/interpolation branch? Depending on how much outstanding work there is in that branch, it may be best to focus some Cherab effort on completing that as it's very much beneficial to both projects.

They would fix the issues and that is the branch I was working in. I may have a quick look over the code this weekend and have a look at how much work remains to be done. If I remember correctly the cubic coefficient calculation is done, just the differentials and the data handling infrastructure need implementing. Adding the extrapolation and differentials that are present in the cherab interpolators would be a 2nd phase of work. It is a fair bit of tedious development, but not hugely complex. Writing the tests will be quite challenging though. That would be a good area of work for someone else to focus on, the tests can be setup before the code is complete. See the cherab test setup as a reference for what I would like rebuilt.

vsnever · 2020-11-12T11:54:58Z

I will also be happy to help with the new interpolators and caching functions. By the way, the current implementation of caching uses general find_index routine ignoring the fact that the function is sampled on a regular grid. This is not an optimal way, because find_index iterates over the grid, which is not required if the grid is regularly spaced.

Until the fix is ready, should we add a workaround in all affected demos?
Something like:

from os import environ
environ['OMP_NUM_THREADS'] = '1'
environ['OPENBLAS_NUM_THREADS'] = '1'
environ['MKL_NUM_THREADS'] = '1'
environ['NUMEXPR_NUM_THREADS'] = '1'
environ['VECLIB_MAXIMUM_THREADS'] = '1'

in the beginning of the scripts.

jacklovell · 2020-11-12T12:36:55Z

I don't think we should pollute the demos with additional environment variables: it adds noise and makes them more confusing for new users. The demos produce the correct output at the moment and performance will be sufficient for most use cases. If users find the performance degradation too debilitating we can point them to this issue and the workarounds contained in it until we can move away from linalg.solve in the interpolators.

@CnlPepper perhaps we could pin this issue on Github so that it's easily visible to others?

vsnever · 2020-11-12T12:59:49Z

The demos produce the correct output at the moment and performance will be sufficient for most use cases.

The performance will depend on the number of parallel threads per CPU core. I ran these demos on a 12-core CPU with 24 parallel threads, which means that when a linalg.solve() was called, the 24² threads were running in parallel. This just hanged my system, and I had to stop the execution or it would never have finished. So maybe instead of setting the environment variables we'll explicitly limit the number of parallel threads to 4 in these demos: MulticoreEngine(4). Or just use the SerialEngine().

CnlPepper · 2020-11-12T13:23:29Z

@jacklovell I think you should have permission to pin issues. I think it would be a good idea.

It would also recommend adding a note to the repository readme and the documentation.

@vsnever the cache objects were rapidly cobbled together by a student, there are a number of inefficiencies that need addressing. The new interpolation methods in raysect should make it easier to completely rebuild these.

jacklovell · 2022-03-04T16:53:15Z

I've created #363 to re-write the caching objects using the new utilities provided by raysect for the interpolation. Once that's approved and we've deprecated the original Cherab interpolators we can close this issue.

jacklovell pinned this issue Nov 12, 2020

jacklovell mentioned this issue Aug 2, 2021

Feature/interpolation raysect/source#393

Merged

This was referenced Aug 15, 2021

Switch to new Raysect interpolators #303

Closed

Use new Raysect interpolators #304

Merged

jacklovell mentioned this issue Mar 4, 2022

Caching rewrite #363

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

vsnever commented Sep 30, 2020

Mateasek commented Sep 30, 2020 •

edited

vsnever commented Sep 30, 2020 •

edited

Mateasek commented Sep 30, 2020

jacklovell commented Oct 1, 2020 •

edited

vsnever commented Oct 1, 2020

jacklovell commented Oct 1, 2020

jacklovell commented Oct 1, 2020 •

edited

vsnever commented Oct 1, 2020

CnlPepper commented Oct 2, 2020 •

edited

jacklovell commented Oct 5, 2020

Mateasek commented Oct 5, 2020

CnlPepper commented Oct 9, 2020

vsnever commented Nov 12, 2020

jacklovell commented Nov 12, 2020

vsnever commented Nov 12, 2020 •

edited

CnlPepper commented Nov 12, 2020 •

edited

jacklovell commented Mar 4, 2022

Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

Severe performance drops in some scenarios if LAPACK is configured to run in multithreaded mode and MulticoreEngine is used #233

Comments

vsnever commented Sep 30, 2020

Mateasek commented Sep 30, 2020 • edited

vsnever commented Sep 30, 2020 • edited

Mateasek commented Sep 30, 2020

jacklovell commented Oct 1, 2020 • edited

vsnever commented Oct 1, 2020

jacklovell commented Oct 1, 2020

jacklovell commented Oct 1, 2020 • edited

vsnever commented Oct 1, 2020

CnlPepper commented Oct 2, 2020 • edited

jacklovell commented Oct 5, 2020

Mateasek commented Oct 5, 2020

CnlPepper commented Oct 9, 2020

vsnever commented Nov 12, 2020

jacklovell commented Nov 12, 2020

vsnever commented Nov 12, 2020 • edited

CnlPepper commented Nov 12, 2020 • edited

jacklovell commented Mar 4, 2022

Mateasek commented Sep 30, 2020 •

edited

vsnever commented Sep 30, 2020 •

edited

jacklovell commented Oct 1, 2020 •

edited

jacklovell commented Oct 1, 2020 •

edited

CnlPepper commented Oct 2, 2020 •

edited

vsnever commented Nov 12, 2020 •

edited

CnlPepper commented Nov 12, 2020 •

edited