You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!
The text was updated successfully, but these errors were encountered:
diffChars splits texts into UTF-16 code units (since JavaScript strings are sort of arrays of UTF-16 code units; e.g. '𢈘'.length is 2) and diffs the sequences of code units. This works fine for any language where every character is represented by a single UTF-16 code unit (e.g. English) but badly for CJK characters where some characters, like 𢈘, are represented by a "surrogate pair" of two UTF-16 code units.
PRs #395 and #461 both aim to fix this, though neither is merged yet.
diffWords is currently a mess, and multilingual support is one of the ways in which this is so. Roughly speaking, diffWords attempts to split text into an array of tokens where each token is either a "word", a run of whitespace, or a run of punctuation/special characters, and then diff this sequence of tokens. But this tokenization logic is extremely Latin-centric; all non-Latin characters are treated as special characters / punctuation by the tokenizer. There is also a bug in the handling of accents, so even non-English European languages are only dubiously supported. There's also absolutely no support for languages where words are not separated by spaces, like Chinese.
There are a whole bunch of issues/PRs about this that you may want to track if you're interested in seeing when support gets added:
what is the general approach to work with other languages like Chinese?
Right now, I would suggest tokenizing into words using an Intl.Segmenter and then diffing using diffArrays. But perhaps I'll improve things soon and there'll be a nicer option using diffWords.
If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!
The text was updated successfully, but these errors were encountered: