Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Decrease wall time of ma.cov and ma.corrcoef #26285

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

christopher-titchen
Copy link

This PR significantly decreases the wall time of the ma.cov and ma.corrcoef functions which are currently performing slowly. The estimation of covariance and correlation matrices, particularly from data containing missing observations, is important in a number of fields like forecasting.

This improvement is achieved by:

  • Casting the NOT mask in ma.extras._covhelper from its current np.int64 to np.float32 if the integers in the $N_{X_{i}X_{j}} - ddof$ normalisation can be accurately represented with single-precision (i.e. $N \le 2^{24}$), otherwise casting to np.float64, resulting in faster computation of the denominator in both cases.
  • Removing the call to ma.dot in ma.cov and ma.corrcoef when estimating $\Sigma_{X_{i}X_{j}}$ and replacing it with a call to np.dot. This results in one less dot product, as ma.dot contains an additional dot product to compute the resultant mask, which is obsolete in ma.cov and ma.corrcoef as we can simply use the factor to determine it instead by identifying locations where $N_{X_{i}X_{j}} - ddof \le 0$.
  • Removing the loop in ma.corrcoef that estimates the normalisation denominator, $\sigma_{X_{i}}\sigma_{X_{j}}$, using a pairwise approach for each variable combination. Although removing this pairwise estimation of denominator results in a slightly more naïve estimation of the correlation coefficients (depending on the mask), the wall time is significantly improved. However, I do not believe this is an issue, as in both this PR and the current implementation of ma.extras._covhelper, the estimation of $X_{i} - \bar{X_{i}}$ and $X_{j} - \bar{X_{j}}$ is naïve in the sense that it uses the sample mean rather than the pairwise sample mean like other implementations (e.g. stats::cov in R). This makes the assumption that $\bar{X_{i}}$ and $\bar{X_{j}}$ are the same for every pairwise combination of $i$ and $j$, which only holds true if there is complete temporal alignment in the observations. Subsequently, even the current implementations of ma.cov and ma.corrcoef return a naïve estimation of the covariance structure and correlation coefficients respectively in the presence of masked values. Note: ma.cov returns exactly the same result in this PR's implementation as the current implementation.

Further work should be done in future PR to incorporate an approach with a stable single-pass algorithm in C, like Welford's online algorithm adapted for the presence of missing values and including the calculation of the sum of squared deviations from the means, for estimating the covariance and correlation coefficients, to ensure alignment with other popular implementations. As mentioned previously, stats::cov in R has a great implementation. If anybody is interested, I am happy to send them my parallelised implementation in Numba (LLVM). However, I do believe these naïve approaches have their place as extremely fast estimators of the covariance structure and correlation coefficients, as in the majority of cases they return similar results within a small tolerance in a fraction of the time.

A test has also been added for ma.extras._covhelper to reflect the changes.

The benchmarks and code to generate them are below.

ma.cov Current PR Wall Time Reduction
Small (100 x 10) $653 \ \mu s$ $325 \ \mu s$ $50.2$%
Medium (1,000 x 100) $96.8 \ ms$ $10.2 \ ms$ $89.5$%
Large (10,000 x 1000) $82 \ s$ $2.81 \ s$ $96.6$%
ma.corrcoef Current PR Wall Time Reduction
Small (100 x 10) $1.78 \ s$ $546 \ \mu s$ $99.97$%
Medium (1,000 x 100) $181 \ s$ $36.7 \ ms$ $99.98$%
Large (10,000 x 1000) $\gt 3600 \ s$* $5.81 \ s$ $\gt 99.84$%*

* The function was still evaluating after an hour, and would probably take between 5–8 hours to finish based on the previous two results.

%load_ext autoreload
%autoreload 2

import gc
import timeit

import numpy as np

# Configure the array sizes.
array_sizes = {
    "Small (100 x 10)": (100, 10),
    "Medium (1,000 x 100)": (1000, 100),
    "Large (10,000 x 1,000)": (10000, 1000),
}

# Set the proportion of masked values.
prop_mask = 0.2

# Create masked arrays of each specified size.
rng = np.random.default_rng()
arrays = {
    desc: rng.random((n[0], n[0]), dtype=np.float64) for desc, n in array_sizes.items()
}
masked_arrays = {
    desc: np.ma.array(arr, mask=(arr <= prop_mask)) for desc, arr in arrays.items()
}

# Tidy up before benchmarking.
del array_sizes, prop_mask, rng, arrays
gc.collect()

# Benchmark the functions.
!git rev-parse --short HEAD

for fun in [np.ma.cov, np.ma.corrcoef]:
    print("\n", "=" * 72, "\n", fun.__name__, "\n", "=" * 72, sep="")
    for desc, arr in masked_arrays.items():
        print("\n", desc, "\n", "-" * len(desc), "\n", sep="")
        if desc.split()[0] == "Large":
            %timeit -t -r1 -n1 fun(arr)
        else:
            %timeit -t fun(arr)

@rgommers rgommers added the component: numpy.ma masked arrays label Apr 16, 2024
@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy.ma masked arrays triage review Issue/PR to be discussed at the next triage meeting
Projects
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

None yet

3 participants