Unexpected diffArrays result on a basic example #474

pbmallinson · 2024-01-10T23:06:57Z

diffArrays() from diff@5.1.0 reports misleading change data in the following case.

specifically, the second and third instances of [ ...'words', 'to', ... ] in the array are reported as unchanged then added, where the expected behaviour is that they should be reported as added then unchanged.

"use strict";
Object.defineProperty(exports, "__esModule", { value: true });
const diff_1 = require("diff");
const a = ["words", "to", "start", "Words", "To", "Change", "words", "to", "end",];
const b = ["words", "to", "start", "words", "to", "change", "words", "to", "end",];
console.log("diffArrays result:", (0, diff_1.diffArrays)(a, b));

yields unexpected:

diffArrays result: [
  { count: 3, value: [ 'words', 'to', 'start' ] },
  {
    count: 3,
    added: undefined,
    removed: true,
    value: [ 'Words', 'To', 'Change' ]
  },
  { count: 2, value: [ 'words', 'to' ] },
  {
    count: 3,
    added: true,
    removed: undefined,
    value: [ 'change', 'words', 'to' ]
  },
  { count: 1, value: [ 'end' ] }
]

expected:

diffArrays result: [
  { count: 3, value: [ 'words', 'to', 'start' ] },
  {
    count: 3,
    added: undefined,
    removed: true,
    value: [ 'Words', 'To', 'Change' ]
  },
  {
    count: 3,
    added: true,
    removed: undefined,
    value: [ 'words', 'to', 'change' ]
  },
  { count: 3, value: [ 'words', 'to', 'end' ] }
]

By contrast, the diffWords() implementation produces expected output.

"use strict";
Object.defineProperty(exports, "__esModule", { value: true });
const diff_1 = require("diff");
const orginalString = "words to start Words To Change words to end";
const updatedString = "words to start words to change words to end";
console.log("diffWords result:", (0, diff_1.diffWords)(orginalString, updatedString));

yields expected:

diffWords result: [
  { count: 6, value: 'words to start ' },
  { count: 1, added: undefined, removed: true, value: 'Words' },
  { count: 1, added: true, removed: undefined, value: 'words' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'To' },
  { count: 1, added: true, removed: undefined, value: 'to' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'Change' },
  { count: 1, added: true, removed: undefined, value: 'change' },
  { count: 6, value: ' words to end' }
]

The text was updated successfully, but these errors were encountered:

ExplodingCabbage · 2024-01-11T11:10:41Z

are reported as unchanged then added, where the expected behaviour is that they should be reported as added then unchanged

But why is this what you expect? The perspective of the Myers diff algorithm is that the total number of insertions and deletions is the same with either series of edits, and therefore they're equally good. What's the alternative perspective, or intuition about what makes a better or worse diff, that leads to regarding this diff (where I've marked insertions as deletions with +++s and ---s) as correct...

["words", "to", "start", "Words", "To", "Change", ++++++++, +++, ++++++++, "words", "to", "end",];
["words", "to", "start", -------, ----, --------, "words", "to", "change", "words", "to", "end",];

... and this one as incorrect?:

["words", "to", "start", "Words", "To", "Change", "words", "to", ++++++++, +++++++, ++++, "end",];
["words", "to", "start", -------, ----, --------, "words", "to", "change", "words", "to", "end",];

I guess - but it'd be helpful if you thought about it and let me know if you agree - that what's underlying your perspective here is that you're thinking about the transformation between the two texts in terms of substitutions, not just insertions and deletions. (I guess the mention of "Change"ing words is a giveaway!) That is, the way you'd represent the transformation between these texts is something like this (this time using ▲s and ▼s to mark words that get substituted)

["words", "to", "start", ▼"Words"▼, ▼"To"▼, ▼"Change"▼, "words", "to", "end",];
["words", "to", "start", ▲"words"▲, ▲"to"▲, ▲"change"▲, "words", "to", "end",];

If we introduce this idea of a substitution being a single edit (rather than having to be represented as two edits, namely a deletion and an insertion), then your intuition about the right diff here follows automatically; there is suddenly a single optimal diff and it involves three substitutions and no insertions or deletions.

You're not the only person to have posted an issue essentially because they expected the diffing algorithm to think in terms of substitutions. #378 is another I closed just yesterday.

But jsdiff simply has no concept of a "substitution", because it's based on the Myers diff algorithm in which edits can only be insertions or deletions. So without changing the core algorithm to something other than Myers, or at least adding an option to use an alternative core algorithm, it'd be very difficult to produce the results you want.

By contrast, the diffWords() implementation produces expected output.

This is actually a side effect of something I consider a bug in diffWords(): it treats the whitespace sequences between words as tokens, which tends to skew diffs towards basically doing substitutions - i.e. insertions and deletions between the same pairs of whitespace tokens. This might sound like exactly what you want, but it's not quite the same as actually supporting treating substitutions as single edits and has consequences that feel clearly wrong like this being the diff between one two three four five six and four five six seven eight nine:

(you can test this yourself at http://incaseofstairs.com/jsdiff/). I'm planning on fixing this, but in the process, I'll end up breaking the "correct" behaviour you're observing in this example.

I think the solution that would make everyone happy is to have an option you can pass to diffing functions that tells them to compute Levenshtein diffs with substitutions instead of Myers diffs. I'm going to open an issue for that.

ehoogeveen-medweb · 2024-01-11T12:00:38Z

In the above, does the Myers algorithm produce only the 2nd variant, or does it consider both options and choose the 2nd because they're considered equal? If it's the latter, maybe a tie breaker could be to try and minimize the distance between the insertions and deletions (which would produce the 1st variant).

Edit: Oops, switched the 1st and 2nd variant.

ExplodingCabbage · 2024-01-11T12:41:57Z

In the above, does the Myers algorithm produce only the 2nd variant, or does it consider both options and choose the 2nd because they're considered equal?

The former, I'm afraid. After it's considered an incomplete path through the edit graph (i.e. a start of a diff) that keeps "words", "to", "start" and deletes "Words", "To", "Change", it immediately decides it should extend that path by preserving the common words "words", "to" and doesn't even consider extending the path in any other way. This is actually one of the key ideas in the Myers algorithm, arguably the key idea - that any time you get the opportunity to continue your diff by keeping some common words instead of doing insertions or deletions, it's absolutely always optimal (or at least joint-optimal, as in this case) to do so, and so you needn't even consider doing an insertion or deletion instead.

So the tiebreaker approach you consider isn't possible - at least not without some kind of significant change to the algorithm (which, since it would involve considering more possible paths through the edit graph, would unavoidably have at least modest negative performance consequences... and perhaps even really severe ones in some pathological cases, though I'm not sure of that).

pbmallinson · 2024-01-13T06:36:16Z

Thanks for your detailed respone - much appreciated.
I'm not familiar with detail of the Myers algorithm but the conclusion seems to be that Myers diff algorithm and therefore jsdiff is not going to work in this use-case.
Is anyone aware of any alternate algorithms/packages which may be worth exploring?

Some notes below to answer your questions as you think about #475

are reported as unchanged then added, where the expected behaviour is that they should be reported as added then unchanged

But why is this what you expect? The perspective of the Myers diff algorithm is that the total number of insertions and deletions is the same with either series of edits, and therefore they're equally good. What's the alternative perspective, or intuition about what makes a better or worse diff, that leads to regarding this diff (where I've marked insertions as deletions with +++s and ---s) as correct...

The expectaions comes from the name of the function more than any detailed understanding of the Myers algorithm; I would expect diffArrays() to produce output which shows the difference between two arrays. In the original test case, arrayElements[0..2] and [6..8] are unchanged and so the expectation is that leads to

["words", "to", "start", "Words", "To", "Change", ++++++++, +++, ++++++++, "words", "to", "end",];
["words", "to", "start", -------, ----, --------, "words", "to", "change", "words", "to", "end",];

If one were to reverse the order of the words in the test case: "end to words change to word start to words" then the current implementation would show "end to words" as unchanged - as a human I can trivially see this pattern and expected that diffArrays() would behave similarly. Perhaps in my mind I'm simultaneously processing the arrays both forwards and backwards to identify the changed portion in the middle..

Sidebar: I just tested the reversed array and (surpisingly?) it produces the expected output of:

diffArrays result: { count: 3, value: [ 'end', 'to', 'words' ] } {
  count: 3,
  added: undefined,
  removed: true,
  value: [ 'Change', 'To', 'Words' ]
} {
  count: 3,
  added: true,
  removed: undefined,
  value: [ 'change', 'to', 'words' ]
} { count: 3, value: [ 'start', 'to', 'words' ] }

"use strict";
Object.defineProperty(exports, "__esModule", { value: true });
const diff_1 = require("diff");
const rorginalStringArray = ["end", "to", "words", "Change", "To", "Words", "start", "to", "words",];
const rupdatedStringArray = ["end", "to", "words", "change", "to", "words", "start", "to", "words",];
console.log("diffArrays result:", ...(0, diff_1.diffArrays)(rorginalStringArray, rupdatedStringArray));

Regardless it sounds like the current jsdiff implementation is behaving as intended and so I will need to look elsewhere for a solution.

I guess - but it'd be helpful if you thought about it and let me know if you agree - that what's underlying your perspective here is that you're thinking about the transformation between the two texts in terms of substitutions, not just insertions and deletions. (I guess the mention of "Change"ing words is a giveaway!) That is, the way you'd represent the transformation between these texts is something like this (this time using ▲s and ▼s to mark words that get substituted)

In a sense I can see the notion of substitution (or edit/change) being a useful concept but addition/deletion need to be retained for cases where the two array lengths are different.
The bigger issue for what I'm currently doing is that the Myers algoritm treats all elemets which pass the equality test as equivalent (in the project I'm working on this is not the case as each array element has associated meta-data). To funciton in the way which is required by my current use case the algorithm output would need to take account of how deletion/insertion impact entity array position and create a change set which minimises cahnges to element sequences from the initial array.

Thanks for your input - I need to revisit my design approach and look for another solution..

ExplodingCabbage · 2024-01-13T09:52:28Z

In a sense I can see the notion of substitution (or edit/change) being a useful concept but addition/deletion need to be retained for cases where the two array lengths are different.

Yeah, the Levenshtein algorithm allows each edit to be a single insertion, single deletion, or single substitution, and tries to minimize the number of edits needed. I'm not 100% sure from your description but suspect it's what you need here. I'm not sure what JS libraries are out there that support getting a Levenshtein diff, though (i.e. getting the actual sequence of edits); for some reason lots of Levenshtein algorithm implementations only return the Levenshtein distance, i.e. the number of edits, not the edits themselves. I would be interested to hear about it if you find one.

ExplodingCabbage closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024

ExplodingCabbage mentioned this issue Jan 11, 2024

Support diffs with substitutions #475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected diffArrays result on a basic example #474

Unexpected diffArrays result on a basic example #474

pbmallinson commented Jan 10, 2024

ExplodingCabbage commented Jan 11, 2024

ehoogeveen-medweb commented Jan 11, 2024 •

edited

ExplodingCabbage commented Jan 11, 2024 •

edited

pbmallinson commented Jan 13, 2024

ExplodingCabbage commented Jan 13, 2024

Unexpected diffArrays result on a basic example #474

Unexpected diffArrays result on a basic example #474

Comments

pbmallinson commented Jan 10, 2024

ExplodingCabbage commented Jan 11, 2024

ehoogeveen-medweb commented Jan 11, 2024 • edited

ExplodingCabbage commented Jan 11, 2024 • edited

pbmallinson commented Jan 13, 2024

ExplodingCabbage commented Jan 13, 2024

ehoogeveen-medweb commented Jan 11, 2024 •

edited

ExplodingCabbage commented Jan 11, 2024 •

edited