Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML in text handling error? #373

Closed
designosis opened this issue May 28, 2022 · 2 comments
Closed

HTML in text handling error? #373

designosis opened this issue May 28, 2022 · 2 comments

Comments

@designosis
Copy link

designosis commented May 28, 2022

In the demo, with the text <p>Guess what?</p> in the first field, and a space after the ? in the second ...

Screen Shot 2022-05-27 at 10 53 52 PM

It seems to think the </ has changed as well? Perhaps it groups all non-alphanumerics? I'd suggest adding a parameter that gives tags special treatment.

Edit: I guess this isn't designed to handle HTML tags at all :) Is there a way to render existing tags (such as <p></p> or <h1></h1>)?

@ExplodingCabbage
Copy link
Collaborator

Perhaps it groups all non-alphanumerics?

Yeah, essentially this. diffWords basically splits the text into an array of tokens where each token can be:

  • a word, or
  • a newline, or
  • a run of non-newline whitespace characters, or
  • a run of punctuation characters

and then diffs the sequence of tokens. So without the space, '?</' is one token, and with the space, '?', ' ', '</' are three distinct tokens.

@ExplodingCabbage
Copy link
Collaborator

More thoughts:

I'm hoping to significantly rework the diffWords behaviour soon in the next breaking-changes release of jsdiff, but tbh I don't think I'll fix this. diffWords is intended for natural language text, not HTML or other code. To do some kind of semantics-aware diff of HTML, you'll want to tokenize the text yourself. e.g. perhaps with the help of a HTML parser you could construct an array where each token is one of:

  • a doctype declaration
  • a tag
  • a comment
  • an attribute name+value
  • the text content of an element

Then you could diff this with diffArrays, perhaps using a comparator that ignores semantically-irrelevant whitespace changes. (I've recently added a brief section in the README discussing this approach.)

The crux from my perspective is, though, a tokenization approach optimised for meaningfully diffing HTML is likely to look very different for one meant for diffing natural language text, because the syntax of HTML vs ordinary text is meaningfully different, and so I wouldn't even really want to add options to diffWords to try to optimize it for diffing code; I'd expect it to still do a poor job and end up confusingly complicated in the process.

It's fine, of course, to ignore this and use diffWords as a quick-and-dirty way of diffing code if warts like the one shown in this issue aren't important to you. I just think this approach should be seen as a quick-and-dirty hack, and that users shouldn't expect diffWords to magically work nicely on both prose and code. Right now it kind of sucks for both (see #311, #214, #436 for some of the ways it behaves horribly even on prose), and I'd like to write off using it for diffing code as something we don't support so I have more freedom to try to fix those issues and optimise it for diffing prose.

For that reason, although this issue is not unreasonable, I'm gonna close it as "Won't fix".

@ExplodingCabbage ExplodingCabbage closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants