Optimize multiline diff for (some metric of) edit-distance #27

nickolay · 2019-01-14T01:52:52Z

I'm trying to use pretty-assertions for snapshot testing. I compare the snapshotted multiline-string with the new value using the PrettyString trick from #24.

When the actual result got indented (compared to the snapshot), the diff output became very confused:

The reason is that two unrelated lines (Foo's and Bar's lorems) got matched up in the first (line-based) comparison pass, which should instead realize there's a new line Some(, followed by a few modified lines, followed by a new line ).

#[test]
fn test_multiline_diff() {
    assert_eq!(PrettyString(r#"Foo {
    lorem: "Hello World!",
    ipsum: 42,
    Bar {
        lorem: "Hello World!",
    }
}"#), PrettyString(r#"Some(
    Foo {
        lorem: "Hello World!",
        ipsum: 42,
        Bar {
            lorem: "Hello World!",
        }
    }
)"#));
}

The text was updated successfully, but these errors were encountered:

colin-kiegel · 2019-02-15T17:52:55Z

I agree, that the presentation is confusing - despite being a correct solution, it doesn´t seem to be an optimal solution.

You already mentioned, that the current heuristic algorithm is doing two passes and the first is line-based, while the second pass is character-based. Thats the problem: The first pass doesnt look at whitespaces (or any metric of edit-distance)...

I´m open for suggestions about a better algorithm. ^^
(This must have been solved before.. ;-))

cormacrelf · 2019-11-06T05:14:20Z

I deleted a previous comment that was wildly inaccurate.

What we actually want here is

A subjectively better diff algorithm than the difference crate's plain LCS, like Patience or Histogram, which is a faster version of Patience: look through the figures in https://rdcu.be/bWfbo for a sense of how much better it is than Myers, or tell Git to use diff.algorithm = histogram and try it out.
Whitespace-insensitivity in the first/line-based pass. A newtype wrapper over &str that implements ~~(Partial)Ord~~ (Partial)Eq by ignoring consecutive whitespace would do this without infecting the diff algorithm code. Is this desirable for all pretty-assertions, or would it need to be configurable? I would guess it could be applied everywhere. Any absolute differences in the entire debug string, including whitespace, should still cause test failure IMO. A 'no diff output, which indicates changes in whitespace only' message would help.

Edit: it would also need to implement Hash with the same whitespace-ignoring strategy to work with Histogram/Patience. Also that paper's on arXiv https://arxiv.org/abs/1902.02467

colin-kiegel added the help wanted label Feb 15, 2019

colin-kiegel changed the title ~~Multiline diff gets confused by whitespace~~ Optimize multiline diff for (some metric of) edit-distance Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize multiline diff for (some metric of) edit-distance #27

Optimize multiline diff for (some metric of) edit-distance #27

nickolay commented Jan 14, 2019

colin-kiegel commented Feb 15, 2019

cormacrelf commented Nov 6, 2019 •

edited

Optimize multiline diff for (some metric of) edit-distance #27

Optimize multiline diff for (some metric of) edit-distance #27

Comments

nickolay commented Jan 14, 2019

colin-kiegel commented Feb 15, 2019

cormacrelf commented Nov 6, 2019 • edited

cormacrelf commented Nov 6, 2019 •

edited