Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does jsdiff work with Chinese (or other non-English script languages)? #377

Closed
lancejpollard opened this issue Jul 1, 2022 · 1 comment
Labels

Comments

@lancejpollard
Copy link

If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!

@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Jan 10, 2024

Currently, JS-diff is English-centric.

diffChars splits texts into UTF-16 code units (since JavaScript strings are sort of arrays of UTF-16 code units; e.g. '𢈘'.length is 2) and diffs the sequences of code units. This works fine for any language where every character is represented by a single UTF-16 code unit (e.g. English) but badly for CJK characters where some characters, like 𢈘, are represented by a "surrogate pair" of two UTF-16 code units.

PRs #395 and #461 both aim to fix this, though neither is merged yet.

diffWords is currently a mess, and multilingual support is one of the ways in which this is so. Roughly speaking, diffWords attempts to split text into an array of tokens where each token is either a "word", a run of whitespace, or a run of punctuation/special characters, and then diff this sequence of tokens. But this tokenization logic is extremely Latin-centric; all non-Latin characters are treated as special characters / punctuation by the tokenizer. There is also a bug in the handling of accents, so even non-English European languages are only dubiously supported. There's also absolutely no support for languages where words are not separated by spaces, like Chinese.

There are a whole bunch of issues/PRs about this that you may want to track if you're interested in seeing when support gets added:

what is the general approach to work with other languages like Chinese?

Right now, I would suggest tokenizing into words using an Intl.Segmenter and then diffing using diffArrays. But perhaps I'll improve things soon and there'll be a nicer option using diffWords.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants