ENH: Decrease wall time of ma.cov
and ma.corrcoef
#26285
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR significantly decreases the wall time of the
ma.cov
andma.corrcoef
functions which are currently performing slowly. The estimation of covariance and correlation matrices, particularly from data containing missing observations, is important in a number of fields like forecasting.This improvement is achieved by:
ma.extras._covhelper
from its currentnp.int64
tonp.float32
if the integers in thenp.float64
, resulting in faster computation of the denominator in both cases.ma.dot
inma.cov
andma.corrcoef
when estimatingnp.dot
. This results in one less dot product, asma.dot
contains an additional dot product to compute the resultant mask, which is obsolete inma.cov
andma.corrcoef
as we can simply use the factor to determine it instead by identifying locations wherema.corrcoef
that estimates the normalisation denominator,ma.extras._covhelper
, the estimation ofstats::cov
in R). This makes the assumption thatma.cov
andma.corrcoef
return a naïve estimation of the covariance structure and correlation coefficients respectively in the presence of masked values. Note:ma.cov
returns exactly the same result in this PR's implementation as the current implementation.Further work should be done in future PR to incorporate an approach with a stable single-pass algorithm in C, like Welford's online algorithm adapted for the presence of missing values and including the calculation of the sum of squared deviations from the means, for estimating the covariance and correlation coefficients, to ensure alignment with other popular implementations. As mentioned previously,
stats::cov
in R has a great implementation. If anybody is interested, I am happy to send them my parallelised implementation in Numba (LLVM). However, I do believe these naïve approaches have their place as extremely fast estimators of the covariance structure and correlation coefficients, as in the majority of cases they return similar results within a small tolerance in a fraction of the time.A test has also been added for
ma.extras._covhelper
to reflect the changes.The benchmarks and code to generate them are below.
ma.cov
ma.corrcoef
* The function was still evaluating after an hour, and would probably take between 5–8 hours to finish based on the previous two results.