Replace known-problematic variance algorithm (for `column_variance`) #1320

andrewjradcliffe · 2023-11-17T00:53:32Z

The existing algorithm for column_variance uses the textbook formula (E[X^2] - E[X]^2), which is well-established to have numerical issues. While the intention (traversal
of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula E[(x - E[x])(x - E[x]]) can be substituted without issue.

Notably, the other variance implementations in the statistics module use E[(x -E[x])(x - E[x]]). Loss of precision aside, keeping the existing implementation of column_variance causes the obvious absurdity:

use nalgebra::Matrix2x3;

let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0);
assert_ne!(m.column_variance().transpose(), m.transpose().row_variance());

We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR.

For a comprehensive analysis of variance algorithms, see this reference, in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the statistics module.

The existing algorithm for `column_variance` uses the textbook formula (`E[X^2]` - E[X]^2), which is well-established to have numerical issues. While the intention (traversal of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula `E[(x - E[x])(x - E[x]])` can be substituted without issue. Notably, the other variance implementations in the `statistics` module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside, keeping the existing implementation of `column_variance` causes the obvious absurdity: ```rust use nalgebra::Matrix2x3; let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0); assert_ne!(m.column_variance().transpose(), m.transpose().row_variance()); ``` We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR. For a comprehensive analysis of variance algorithms, see this [reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf), in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the `statistics` module.

tpdickso

LGTM! We might be able to use compress_columns here, to make this effectively just a clone of the implementation of variance -- however I think that might introduce unnecessary column clones when dealing with an unconstrained scalar type and dynamic-sized columns, so I think this element-wise approach makes sense.

andrewjradcliffe · 2024-01-04T02:59:11Z

I considered the compress_columns route, but wanted to avoid unnecessary allocations with dynamically-sized data.

Andlon · 2024-04-03T18:15:06Z

Stats aren't really my forte, but looks good to me. However, I think if there's an improvement in precision, it would be good to actually try to verify this with a test that failed before and passes now.

I'm also concerned about the general lack of tests for this method. There appears only to be the single doc test in its documentation. This is not the fault of @andrewjradcliffe, of course, but it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

andrewjradcliffe · 2024-04-11T01:25:34Z

it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.

I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?

Andlon · 2024-04-11T08:43:11Z

it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.

Well, I think ideally we would already have tests for:

a square matrix
a rectangular matrix m x n with m < n
a rectangular matrix m x n with m > n
edge cases: 0 x 0, m x 0, 0 x n

Unfortunately we don't have this at the moment, so we can only hope that the existing implementation works correctly for these cases. However, I'm a little reluctant to merge a new implementation without these tests, at the very least to make sure we don't regress on any of these cases. Would you be able to add these tests?

I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?

I think, if the claim is that the new implementation has higher precision, it would indeed be prudent to somehow try to verify that it actually delivers this — at least for one test case.

tpdickso approved these changes Dec 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace known-problematic variance algorithm (for `column_variance`) #1320

Replace known-problematic variance algorithm (for `column_variance`) #1320

andrewjradcliffe commented Nov 17, 2023

tpdickso left a comment

andrewjradcliffe commented Jan 4, 2024

Andlon commented Apr 3, 2024

andrewjradcliffe commented Apr 11, 2024

Andlon commented Apr 11, 2024

Replace known-problematic variance algorithm (for column_variance) #1320

Are you sure you want to change the base?

Replace known-problematic variance algorithm (for column_variance) #1320

Conversation

andrewjradcliffe commented Nov 17, 2023

tpdickso left a comment

Choose a reason for hiding this comment

andrewjradcliffe commented Jan 4, 2024

Andlon commented Apr 3, 2024

andrewjradcliffe commented Apr 11, 2024

Andlon commented Apr 11, 2024

Replace known-problematic variance algorithm (for `column_variance`) #1320

Replace known-problematic variance algorithm (for `column_variance`) #1320