improperly diffing of words #121

dpastoor · 2016-06-10T16:11:34Z

Given original text of:

ggplot(data = sd_oral_richpk, aes(x = Weight)) + 
  geom_histogram(binwidth = 4, color = "black", fill = "grey") + 
  theme_bw() + 
  base_theme()

with a final of:

ggplot(data = sd_oral_richpk, aes(x = Weight)) + 
  geom_histogram(binwidth = 4, color = "black") + 
  labs(
    x="Weight (kg)", 
    y = "Count") + 
  theme_bw() + 
  base_theme()

get the following when diff'ing by words

it should instead just have red removal for the fill="grey" on line 2, with the labs portion being all green

The text was updated successfully, but these errors were encountered:

Tchanders · 2016-08-31T20:29:20Z

I guess this is just an oddity of the longest common sub-string approach: it matches /= "/ from within /fill = "grey"/ with /= "/ from within /y = "Count"/

Mingun · 2018-03-10T13:46:15Z

I think that in this case it is better to build distinctions in 2 passes -- at first in the lines, and then in lines

ExplodingCabbage · 2023-12-20T11:46:33Z

Figuring out behaviour for diffWords that will make everyone happy is a pain; I've been going through the backlog of issues on jsdiff and diffWords yielding results that people don't expect is something that comes up again and again.

I think I should probably do two things:

document the behaviour of the existing functions more precisely than they're documented right now. We don't give enough detail in the README to let people predict what diff something like diffWords will output without either reading the code or experimenting, and the result is that everyone expects it to just magically do the thing they find intuitive for their particular use case and then feels disappointed when it doesn't. With better docs we can at least move the disappointment earlier in the process, before potential users have wasted effort.
think about what some good general tactics are for doing diffs of code, and add docs or functions for this. Possible tactics to think about:
- line diff first, then per-line word or character diffs, as @Mingun suggests above. (Very difficult for an end user to implement using jsdiff right now; the hard part is figuring out which non-identical lines to treat as slightly-changed versions of each other and diff.)
- run the code through a language-specific tokenizer and diff the arrays of tokens. (I suspect this will give the best results in cases where it works, but may simply not be possible in some cases due to the language being unknown or due to one or both of the texts to be diffed containing syntax errors and the tokenizer not having any kind of mode where it forgives syntax errors and keeps tokenizing.)
- some slightly different general strategy for handling words and punctuation that slightly differs from diffWords, e.g.
  - treat every punctuation character as its own element in the array of elements to diff, instead of joining consecutive punctuation characters into a single element. Or...
  - treat inserts/deletions of punctuation as having lower edit cost than inserts/deletions of words. Or...
  - merge each word into a single array element with its adjacent punctuation, but treat replacing an element with an element that differs from it only by punctuation as having a low edit cost
    (It's non-obvious to me whether any of these would actually be good ideas! The point is just that there's a vast number of ways to tinker with diffWords and maybe some of them are better for code; it requires some careful thought to decide one way or another.)

I don't think (contra the framing in the issue here) that diffWords is doing anything wrong in this example, per se. It implements a simple algorithm (approximately: split each text into an alternating array of [word, run of punctuation, word, run of punctuation, ...] and then diff the two arrays against each other using the Myers diff algorithm) that typically gives reasonable results for human language text and that jsdiff never promised would give good diffs for code.

Mingun · 2023-12-20T13:30:53Z

run the code through a language-specific tokenizer and diff the arrays of tokens

I think that this is the best strategy; current behavior then could be the "unknown" language tokenizer. Actually, jsdiff is working that way, just the tokenizer tries to fit of everyone needs, which could be good in one situations and bad in other.

ExplodingCabbage added the diffWords behaviour label Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improperly diffing of words #121

improperly diffing of words #121

dpastoor commented Jun 10, 2016

Tchanders commented Aug 31, 2016

Mingun commented Mar 10, 2018

ExplodingCabbage commented Dec 20, 2023

Mingun commented Dec 20, 2023

improperly diffing of words #121

improperly diffing of words #121

Comments

dpastoor commented Jun 10, 2016

Tchanders commented Aug 31, 2016

Mingun commented Mar 10, 2018

ExplodingCabbage commented Dec 20, 2023

Mingun commented Dec 20, 2023