Does jsdiff work with Chinese (or other non-English script languages)? #377

lancejpollard · 2022-07-01T05:55:40Z

If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!

ExplodingCabbage · 2024-01-10T14:34:28Z

Currently, JS-diff is English-centric.

diffChars splits texts into UTF-16 code units (since JavaScript strings are sort of arrays of UTF-16 code units; e.g. '𢈘'.length is 2) and diffs the sequences of code units. This works fine for any language where every character is represented by a single UTF-16 code unit (e.g. English) but badly for CJK characters where some characters, like 𢈘, are represented by a "surrogate pair" of two UTF-16 code units.

PRs #395 and #461 both aim to fix this, though neither is merged yet.

diffWords is currently a mess, and multilingual support is one of the ways in which this is so. Roughly speaking, diffWords attempts to split text into an array of tokens where each token is either a "word", a run of whitespace, or a run of punctuation/special characters, and then diff this sequence of tokens. But this tokenization logic is extremely Latin-centric; all non-Latin characters are treated as special characters / punctuation by the tokenizer. There is also a bug in the handling of accents, so even non-English European languages are only dubiously supported. There's also absolutely no support for languages where words are not separated by spaces, like Chinese.

There are a whole bunch of issues/PRs about this that you may want to track if you're interested in seeing when support gets added:

what is the general approach to work with other languages like Chinese?

Right now, I would suggest tokenizing into words using an Intl.Segmenter and then diffing using diffArrays. But perhaps I'll improve things soon and there'll be a nicer option using diffWords.

ExplodingCabbage added the question label Jan 10, 2024

ExplodingCabbage closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does jsdiff work with Chinese (or other non-English script languages)? #377

Does jsdiff work with Chinese (or other non-English script languages)? #377

lancejpollard commented Jul 1, 2022

ExplodingCabbage commented Jan 10, 2024 •

edited

Does jsdiff work with Chinese (or other non-English script languages)? #377

Does jsdiff work with Chinese (or other non-English script languages)? #377

Comments

lancejpollard commented Jul 1, 2022

ExplodingCabbage commented Jan 10, 2024 • edited

ExplodingCabbage commented Jan 10, 2024 •

edited