-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected diffArrays result on a basic example #474
Comments
But why is this what you expect? The perspective of the Myers diff algorithm is that the total number of insertions and deletions is the same with either series of edits, and therefore they're equally good. What's the alternative perspective, or intuition about what makes a better or worse diff, that leads to regarding this diff (where I've marked insertions as deletions with
... and this one as incorrect?:
I guess - but it'd be helpful if you thought about it and let me know if you agree - that what's underlying your perspective here is that you're thinking about the transformation between the two texts in terms of substitutions, not just insertions and deletions. (I guess the mention of "Change"ing words is a giveaway!) That is, the way you'd represent the transformation between these texts is something like this (this time using ▲s and ▼s to mark words that get substituted)
If we introduce this idea of a substitution being a single edit (rather than having to be represented as two edits, namely a deletion and an insertion), then your intuition about the right diff here follows automatically; there is suddenly a single optimal diff and it involves three substitutions and no insertions or deletions. You're not the only person to have posted an issue essentially because they expected the diffing algorithm to think in terms of substitutions. #378 is another I closed just yesterday. But jsdiff simply has no concept of a "substitution", because it's based on the Myers diff algorithm in which edits can only be insertions or deletions. So without changing the core algorithm to something other than Myers, or at least adding an option to use an alternative core algorithm, it'd be very difficult to produce the results you want.
This is actually a side effect of something I consider a bug in (you can test this yourself at http://incaseofstairs.com/jsdiff/). I'm planning on fixing this, but in the process, I'll end up breaking the "correct" behaviour you're observing in this example. I think the solution that would make everyone happy is to have an option you can pass to diffing functions that tells them to compute Levenshtein diffs with substitutions instead of Myers diffs. I'm going to open an issue for that. |
In the above, does the Myers algorithm produce only the 2nd variant, or does it consider both options and choose the 2nd because they're considered equal? If it's the latter, maybe a tie breaker could be to try and minimize the distance between the insertions and deletions (which would produce the 1st variant). Edit: Oops, switched the 1st and 2nd variant. |
The former, I'm afraid. After it's considered an incomplete path through the edit graph (i.e. a start of a diff) that keeps So the tiebreaker approach you consider isn't possible - at least not without some kind of significant change to the algorithm (which, since it would involve considering more possible paths through the edit graph, would unavoidably have at least modest negative performance consequences... and perhaps even really severe ones in some pathological cases, though I'm not sure of that). |
Thanks for your detailed respone - much appreciated. Some notes below to answer your questions as you think about #475
The expectaions comes from the name of the function more than any detailed understanding of the Myers algorithm; I would expect diffArrays() to produce output which shows the difference between two arrays. In the original test case, arrayElements[0..2] and [6..8] are unchanged and so the expectation is that leads to
If one were to reverse the order of the words in the test case: "end to words change to word start to words" then the current implementation would show "end to words" as unchanged - as a human I can trivially see this pattern and expected that diffArrays() would behave similarly. Perhaps in my mind I'm simultaneously processing the arrays both forwards and backwards to identify the changed portion in the middle..
Regardless it sounds like the current jsdiff implementation is behaving as intended and so I will need to look elsewhere for a solution.
In a sense I can see the notion of substitution (or edit/change) being a useful concept but addition/deletion need to be retained for cases where the two array lengths are different. Thanks for your input - I need to revisit my design approach and look for another solution.. |
Yeah, the Levenshtein algorithm allows each edit to be a single insertion, single deletion, or single substitution, and tries to minimize the number of edits needed. I'm not 100% sure from your description but suspect it's what you need here. I'm not sure what JS libraries are out there that support getting a Levenshtein diff, though (i.e. getting the actual sequence of edits); for some reason lots of Levenshtein algorithm implementations only return the Levenshtein distance, i.e. the number of edits, not the edits themselves. I would be interested to hear about it if you find one. |
diffArrays() from diff@5.1.0 reports misleading change data in the following case.
[ ...'words', 'to', ... ]
in the array are reported as unchanged then added, where the expected behaviour is that they should be reported as added then unchanged.yields unexpected:
expected:
By contrast, the diffWords() implementation produces expected output.
yields expected:
The text was updated successfully, but these errors were encountered: