Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace known-problematic variance algorithm (for column_variance) #1320

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

andrewjradcliffe
Copy link

The existing algorithm for column_variance uses the textbook formula (E[X^2] - E[X]^2), which is well-established to have numerical issues. While the intention (traversal
of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula E[(x - E[x])(x - E[x]]) can be substituted without issue.

Notably, the other variance implementations in the statistics module use E[(x -E[x])(x - E[x]]). Loss of precision aside, keeping the existing implementation of column_variance causes the obvious absurdity:

use nalgebra::Matrix2x3;

let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0);
assert_ne!(m.column_variance().transpose(), m.transpose().row_variance());

We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR.

For a comprehensive analysis of variance algorithms, see this reference, in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the statistics module.

The existing algorithm for `column_variance` uses the textbook
formula (`E[X^2]` - E[X]^2), which is well-established to have
numerical issues. While the intention (traversal
of the elements in column-major order) of the extant algorithm
is apparent, we should not sacrifice precision when we do not need to
-- the two-pass algorithm for variance (N.B. the existing algorithm is
already a two-pass algorithm, anyway) using the formula `E[(x -
E[x])(x - E[x]])` can be substituted without issue.

Notably, the other variance implementations in the `statistics`
module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside,
keeping the existing implementation of `column_variance`
causes the obvious absurdity:

```rust
use nalgebra::Matrix2x3;

let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0);
assert_ne!(m.column_variance().transpose(), m.transpose().row_variance());
```

We can eliminate both the loss of precision the glaring inconsistency
by switching to the implementation provided by this PR.

For a comprehensive analysis of variance algorithms, see this
[reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf),
in particular, Table 2.  The "two-pass" described in the paper is the
implementation given in this PR. In terms of simplicity (hence, easier
to maintain), "two-pass" is a suitable choice; in terms of runtime
performance and precision, it is a good balance (c.f. Youngs & Cramer
and "textbook"). Furthermore, it is consistent with the variance
algorithm used in the other "*variance" algorithms in the `statistics`
module.
Copy link
Collaborator

@tpdickso tpdickso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We might be able to use compress_columns here, to make this effectively just a clone of the implementation of variance -- however I think that might introduce unnecessary column clones when dealing with an unconstrained scalar type and dynamic-sized columns, so I think this element-wise approach makes sense.

@andrewjradcliffe
Copy link
Author

I considered the compress_columns route, but wanted to avoid unnecessary allocations with dynamically-sized data.

@Andlon
Copy link
Sponsor Collaborator

Andlon commented Apr 3, 2024

Stats aren't really my forte, but looks good to me. However, I think if there's an improvement in precision, it would be good to actually try to verify this with a test that failed before and passes now.

I'm also concerned about the general lack of tests for this method. There appears only to be the single doc test in its documentation. This is not the fault of @andrewjradcliffe, of course, but it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

@andrewjradcliffe
Copy link
Author

it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.

I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?

@Andlon
Copy link
Sponsor Collaborator

Andlon commented Apr 11, 2024

it would be nice to have a few more tests that ensure that we're not introducing any new bugs here.

Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur.

Well, I think ideally we would already have tests for:

  • a square matrix
  • a rectangular matrix m x n with m < n
  • a rectangular matrix m x n with m > n
  • edge cases: 0 x 0, m x 0, 0 x n

Unfortunately we don't have this at the moment, so we can only hope that the existing implementation works correctly for these cases. However, I'm a little reluctant to merge a new implementation without these tests, at the very least to make sure we don't regress on any of these cases. Would you be able to add these tests?

I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new?

I think, if the claim is that the new implementation has higher precision, it would indeed be prudent to somehow try to verify that it actually delivers this — at least for one test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants