ENH: Decrease wall time of `ma.cov` and `ma.corrcoef` #26285

christopher-titchen · 2024-04-15T22:50:32Z

This PR significantly decreases the wall time of the ma.cov and ma.corrcoef functions which are currently performing slowly. The estimation of covariance and correlation matrices, particularly from data containing missing observations, is important in a number of fields like forecasting.

This improvement is achieved by:

Casting the NOT mask in ma.extras._covhelper from its current np.int64 to np.float32 if the integers in the $N_{X_{i}X_{j}} - ddof$ normalisation can be accurately represented with single-precision (i.e. $N \le 2^{24}$), otherwise casting to np.float64, resulting in faster computation of the denominator in both cases.
Removing the call to ma.dot in ma.cov and ma.corrcoef when estimating $\Sigma_{X_{i}X_{j}}$ and replacing it with a call to np.dot. This results in one less dot product, as ma.dot contains an additional dot product to compute the resultant mask, which is obsolete in ma.cov and ma.corrcoef as we can simply use the factor to determine it instead by identifying locations where $N_{X_{i}X_{j}} - ddof \le 0$.
Removing the loop in ma.corrcoef that estimates the normalisation denominator, $\sigma_{X_{i}}\sigma_{X_{j}}$, using a pairwise approach for each variable combination. Although removing this pairwise estimation of denominator results in a slightly more naïve estimation of the correlation coefficients (depending on the mask), the wall time is significantly improved. However, I do not believe this is an issue, as in both this PR and the current implementation of ma.extras._covhelper, the estimation of $X_{i} - \bar{X_{i}}$ and $X_{j} - \bar{X_{j}}$ is naïve in the sense that it uses the sample mean rather than the pairwise sample mean like other implementations (e.g. stats::cov in R). This makes the assumption that $\bar{X_{i}}$ and $\bar{X_{j}}$ are the same for every pairwise combination of $i$ and $j$, which only holds true if there is complete temporal alignment in the observations. Subsequently, even the current implementations of ma.cov and ma.corrcoef return a naïve estimation of the covariance structure and correlation coefficients respectively in the presence of masked values. Note: ma.cov returns exactly the same result in this PR's implementation as the current implementation.

Further work should be done in future PR to incorporate an approach with a stable single-pass algorithm in C, like Welford's online algorithm adapted for the presence of missing values and including the calculation of the sum of squared deviations from the means, for estimating the covariance and correlation coefficients, to ensure alignment with other popular implementations. As mentioned previously, stats::cov in R has a great implementation. If anybody is interested, I am happy to send them my parallelised implementation in Numba (LLVM). However, I do believe these naïve approaches have their place as extremely fast estimators of the covariance structure and correlation coefficients, as in the majority of cases they return similar results within a small tolerance in a fraction of the time.

A test has also been added for ma.extras._covhelper to reflect the changes.

The benchmarks and code to generate them are below.

`ma.cov`	Current	PR	Wall Time Reduction
Small (100 x 10)	$653 \ \mu s$	$325 \ \mu s$	$50.2$%
Medium (1,000 x 100)	$96.8 \ ms$	$10.2 \ ms$	$89.5$%
Large (10,000 x 1000)	$82 \ s$	$2.81 \ s$	$96.6$%

`ma.corrcoef`	Current	PR	Wall Time Reduction
Small (100 x 10)	$1.78 \ s$	$546 \ \mu s$	$99.97$%
Medium (1,000 x 100)	$181 \ s$	$36.7 \ ms$	$99.98$%
Large (10,000 x 1000)	$\gt 3600 \ s$*	$5.81 \ s$	$\gt 99.84$%*

* The function was still evaluating after an hour, and would probably take between 5–8 hours to finish based on the previous two results.

%load_ext autoreload
%autoreload 2

import gc
import timeit

import numpy as np

# Configure the array sizes.
array_sizes = {
    "Small (100 x 10)": (100, 10),
    "Medium (1,000 x 100)": (1000, 100),
    "Large (10,000 x 1,000)": (10000, 1000),
}

# Set the proportion of masked values.
prop_mask = 0.2

# Create masked arrays of each specified size.
rng = np.random.default_rng()
arrays = {
    desc: rng.random((n[0], n[0]), dtype=np.float64) for desc, n in array_sizes.items()
}
masked_arrays = {
    desc: np.ma.array(arr, mask=(arr <= prop_mask)) for desc, arr in arrays.items()
}

# Tidy up before benchmarking.
del array_sizes, prop_mask, rng, arrays
gc.collect()

# Benchmark the functions.
!git rev-parse --short HEAD

for fun in [np.ma.cov, np.ma.corrcoef]:
    print("\n", "=" * 72, "\n", fun.__name__, "\n", "=" * 72, sep="")
    for desc, arr in masked_arrays.items():
        print("\n", desc, "\n", "-" * len(desc), "\n", sep="")
        if desc.split()[0] == "Large":
            %timeit -t -r1 -n1 fun(arr)
        else:
            %timeit -t fun(arr)

christopher-titchen added 8 commits April 15, 2024 20:28

ENH: ⚡ Cast NOT mask output of ma.extras._covhelper to float.

70b0fc4

ENH: ⚡ Decrease wall time of ma.cov function.

5374ed3

MAINT: ♻️ Change instantiation of ma.cov result.

818279d

ENH: ⚡ Decrease wall time of ma.corrcoef function.

e2b423f

MAINT: 🚧 Change instantiations in ma.extras for masks.

5b68f23

REV: ⏪ Revert changes in ma.extras._covhelper for casting.

529ec10

TST: ✅ Test ma.extras._covhelper changes.

91a5e73

ENH: ⚡ ⏪ Decrease wall time of ma.corrcoef function further.

72859f5

github-actions bot added the 01 - Enhancement label Apr 15, 2024

MAINT: 🚨 Fix linter warnings.

a23e266

rgommers added the component: numpy.ma masked arrays label Apr 16, 2024

ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Decrease wall time of `ma.cov` and `ma.corrcoef` #26285

ENH: Decrease wall time of `ma.cov` and `ma.corrcoef` #26285

christopher-titchen commented Apr 15, 2024

ENH: Decrease wall time of ma.cov and ma.corrcoef #26285

Are you sure you want to change the base?

ENH: Decrease wall time of ma.cov and ma.corrcoef #26285

Conversation

christopher-titchen commented Apr 15, 2024

ENH: Decrease wall time of `ma.cov` and `ma.corrcoef` #26285

ENH: Decrease wall time of `ma.cov` and `ma.corrcoef` #26285