Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize multiline diff for (some metric of) edit-distance #27

Open
nickolay opened this issue Jan 14, 2019 · 2 comments
Open

Optimize multiline diff for (some metric of) edit-distance #27

nickolay opened this issue Jan 14, 2019 · 2 comments

Comments

@nickolay
Copy link
Contributor

I'm trying to use pretty-assertions for snapshot testing. I compare the snapshotted multiline-string with the new value using the PrettyString trick from #24.

When the actual result got indented (compared to the snapshot), the diff output became very confused:
snap

The reason is that two unrelated lines (Foo's and Bar's lorems) got matched up in the first (line-based) comparison pass, which should instead realize there's a new line Some(, followed by a few modified lines, followed by a new line ).

#[test]
fn test_multiline_diff() {
    assert_eq!(PrettyString(r#"Foo {
    lorem: "Hello World!",
    ipsum: 42,
    Bar {
        lorem: "Hello World!",
    }
}"#), PrettyString(r#"Some(
    Foo {
        lorem: "Hello World!",
        ipsum: 42,
        Bar {
            lorem: "Hello World!",
        }
    }
)"#));
}
@colin-kiegel
Copy link
Collaborator

I agree, that the presentation is confusing - despite being a correct solution, it doesn´t seem to be an optimal solution.

You already mentioned, that the current heuristic algorithm is doing two passes and the first is line-based, while the second pass is character-based. Thats the problem: The first pass doesnt look at whitespaces (or any metric of edit-distance)...

I´m open for suggestions about a better algorithm. ^^
(This must have been solved before.. ;-))

@colin-kiegel colin-kiegel changed the title Multiline diff gets confused by whitespace Optimize multiline diff for (some metric of) edit-distance Feb 21, 2019
@cormacrelf
Copy link

cormacrelf commented Nov 6, 2019

I deleted a previous comment that was wildly inaccurate.

What we actually want here is

  1. A subjectively better diff algorithm than the difference crate's plain LCS, like Patience or Histogram, which is a faster version of Patience: look through the figures in https://rdcu.be/bWfbo for a sense of how much better it is than Myers, or tell Git to use diff.algorithm = histogram and try it out.
  2. Whitespace-insensitivity in the first/line-based pass. A newtype wrapper over &str that implements (Partial)Ord (Partial)Eq by ignoring consecutive whitespace would do this without infecting the diff algorithm code. Is this desirable for all pretty-assertions, or would it need to be configurable? I would guess it could be applied everywhere. Any absolute differences in the entire debug string, including whitespace, should still cause test failure IMO. A 'no diff output, which indicates changes in whitespace only' message would help.

Edit: it would also need to implement Hash with the same whitespace-ignoring strategy to work with Histogram/Patience. Also that paper's on arXiv https://arxiv.org/abs/1902.02467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants