Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improperly diffing of words #121

Open
dpastoor opened this issue Jun 10, 2016 · 4 comments
Open

improperly diffing of words #121

dpastoor opened this issue Jun 10, 2016 · 4 comments

Comments

@dpastoor
Copy link

Given original text of:

ggplot(data = sd_oral_richpk, aes(x = Weight)) + 
  geom_histogram(binwidth = 4, color = "black", fill = "grey") + 
  theme_bw() + 
  base_theme()

with a final of:

ggplot(data = sd_oral_richpk, aes(x = Weight)) + 
  geom_histogram(binwidth = 4, color = "black") + 
  labs(
    x="Weight (kg)", 
    y = "Count") + 
  theme_bw() + 
  base_theme()

get the following when diff'ing by words

image

it should instead just have red removal for the fill="grey" on line 2, with the labs portion being all green

@Tchanders
Copy link

I guess this is just an oddity of the longest common sub-string approach: it matches /= "/ from within /fill = "grey"/ with /= "/ from within /y = "Count"/

@Mingun
Copy link
Contributor

Mingun commented Mar 10, 2018

I think that in this case it is better to build distinctions in 2 passes -- at first in the lines, and then in lines

@ExplodingCabbage
Copy link
Collaborator

Figuring out behaviour for diffWords that will make everyone happy is a pain; I've been going through the backlog of issues on jsdiff and diffWords yielding results that people don't expect is something that comes up again and again.

I think I should probably do two things:

  • document the behaviour of the existing functions more precisely than they're documented right now. We don't give enough detail in the README to let people predict what diff something like diffWords will output without either reading the code or experimenting, and the result is that everyone expects it to just magically do the thing they find intuitive for their particular use case and then feels disappointed when it doesn't. With better docs we can at least move the disappointment earlier in the process, before potential users have wasted effort.
  • think about what some good general tactics are for doing diffs of code, and add docs or functions for this. Possible tactics to think about:
    • line diff first, then per-line word or character diffs, as @Mingun suggests above. (Very difficult for an end user to implement using jsdiff right now; the hard part is figuring out which non-identical lines to treat as slightly-changed versions of each other and diff.)
    • run the code through a language-specific tokenizer and diff the arrays of tokens. (I suspect this will give the best results in cases where it works, but may simply not be possible in some cases due to the language being unknown or due to one or both of the texts to be diffed containing syntax errors and the tokenizer not having any kind of mode where it forgives syntax errors and keeps tokenizing.)
    • some slightly different general strategy for handling words and punctuation that slightly differs from diffWords, e.g.
      • treat every punctuation character as its own element in the array of elements to diff, instead of joining consecutive punctuation characters into a single element. Or...
      • treat inserts/deletions of punctuation as having lower edit cost than inserts/deletions of words. Or...
      • merge each word into a single array element with its adjacent punctuation, but treat replacing an element with an element that differs from it only by punctuation as having a low edit cost
        (It's non-obvious to me whether any of these would actually be good ideas! The point is just that there's a vast number of ways to tinker with diffWords and maybe some of them are better for code; it requires some careful thought to decide one way or another.)

I don't think (contra the framing in the issue here) that diffWords is doing anything wrong in this example, per se. It implements a simple algorithm (approximately: split each text into an alternating array of [word, run of punctuation, word, run of punctuation, ...] and then diff the two arrays against each other using the Myers diff algorithm) that typically gives reasonable results for human language text and that jsdiff never promised would give good diffs for code.

@Mingun
Copy link
Contributor

Mingun commented Dec 20, 2023

run the code through a language-specific tokenizer and diff the arrays of tokens

I think that this is the best strategy; current behavior then could be the "unknown" language tokenizer. Actually, jsdiff is working that way, just the tokenizer tries to fit of everyone needs, which could be good in one situations and bad in other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants