Improve word tokenization for non-Latin characters #328

jihunleekr · 2021-09-19T09:22:07Z

Diff.diffWords is not working on non-Latin characters like Korean.

ExplodingCabbage · 2023-12-18T12:35:46Z

src/diff/word.js

+  '\u0000-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E'; // Basic Latin
+  charsCannotBecomeWord += '\u00A0-\u00BF\u00D7\u00F7'; // Latin-1 Supplement
+  charsCannotBecomeWord += '\u02B9-\u02DD\u02E5-\u02FF'; // Spacing Modifier Letters
+  charsCannotBecomeWord += '\u0300-\u036F'; // Combining Diacritical Marks


Hmm... this can't be right. Treating combining diacritics as word breaks is definitely wrong for European languages that I am familiar with (and I would've thought that it was wrong universally), since it leads to outcomes like this:

> a = 'festou\u0300' 'festoù' > b = 'impube\u0300re' 'impubère' > diff.diffWords(a, b) [ { count: 1, added: undefined, removed: true, value: 'festou' }, { count: 1, added: true, removed: undefined, value: 'impube' }, { count: 1, value: '̀' }, { count: 1, added: true, removed: undefined, value: 're' } ]

That diff ought to just show the one word in the old text being deleted and a totally new word from the new text being added, but instead it shows us preserving the code point for the grave accent in the middle of the word and adding and deleting stuff around the accent. That's nonsense from the perspective of French or any other language I know with accents; the è (an e with a grave accent) in impubère is simply a letter in the word, just like the b and r next to it, and doesn't semantically represent a word break.

Do combining diacritics work differently in Korean (... if they even exist in Korean?) or some other language you're familiar with such that it makes sense to treat them as word breaks? I am afraid I don't speak any non-European languages and am a bit out of my depth! I can imagine in theory a language where accents get used as punctuation, but I have never encountered such a thing...

I hadn't considered such cases due to my limited experience with other languages. It seems like an aspect that requires further thought.

ExplodingCabbage · 2023-12-18T12:58:07Z

src/diff/word.js

+  const tokens = [];
+  let prevCharType = '';
+  for (let i = 0; i < value.length; i++) {
+    const char = value[i];
+    if (spaceRegExp.test(char)) {
+      if(prevCharType === 'space') {
+        tokens[tokens.length - 1] += ' ';
+      } else {
+        tokens.push(' ');
+      }
+      prevCharType = 'space';
+    } else if (cannotBecomeWordRegExp.test(char)) {
+      tokens.push(char);
+      prevCharType = '';
+    } else {
+      if(prevCharType === 'word') {
+        tokens[tokens.length - 1] += char;
+      } else {
+        tokens.push(char);
+      }
+      prevCharType = 'word';


Why do we need logic changes here? It's not obvious to me why the fix for Korean text should involve anything more than just treating Korean letters as letters instead of word boundaries...

It seems like this is also an excessive change. My intention was to separate characters that should exist as single characters into individual tokens. (For example: U+0021, U+0022, ...)

ExplodingCabbage · 2023-12-18T13:00:53Z

src/diff/word.js

+    const char = value[i];
+    if (spaceRegExp.test(char)) {
+      if(prevCharType === 'space') {
+        tokens[tokens.length - 1] += ' ';


In the edge case where there's a long run of spaces in the text somewhere, this is going to take O(n^2) time where n is the number of consecutive spaces.

It also rewrites all space characters of any kind to an ordinary ' '.

Both these aspects of the behaviour here seem wrong!

I agree. It seems like an excessive change.

ExplodingCabbage · 2023-12-18T13:06:53Z

If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But:

the fix here also messes with combining diacritics in ways that seem to me to break existing working behaviour for languages with accents
the fix also changes other aspects of the logic of tokenize beyond which characters are treated as letters vs word breaks, and I can't figure out why
the fix doesn't help us with Japanese or Chinese since those languages don't use spaces (and need a fundamentally different tokenization algorithm like the one provided by Intl.Segmenter). Doesn't by itself make doing this a bad idea, but makes me wonder if we ought to be making a more radical change...

I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest!

Improve word tokenization for non-Latin characters

6132ed9

ExplodingCabbage reviewed Dec 18, 2023

View reviewed changes

ExplodingCabbage added the bugfix label Dec 18, 2023

ExplodingCabbage reviewed Dec 18, 2023

View reviewed changes

ExplodingCabbage mentioned this pull request Dec 18, 2023

Tokenize Regex as Parameter #268

Open

ExplodingCabbage added the breaking-change label Dec 18, 2023

ExplodingCabbage mentioned this pull request Dec 21, 2023

Support using an Intl.Segmenter for word tokenization in diffWords #438

Open

ExplodingCabbage added the diffWords behaviour label Jan 9, 2024

ExplodingCabbage mentioned this pull request Jan 10, 2024

Does jsdiff work with Chinese (or other non-English script languages)? #377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve word tokenization for non-Latin characters #328

Improve word tokenization for non-Latin characters #328

jihunleekr commented Sep 19, 2021

ExplodingCabbage Dec 18, 2023

jihunleekr Dec 21, 2023

ExplodingCabbage Dec 18, 2023

jihunleekr Dec 21, 2023

ExplodingCabbage Dec 18, 2023

jihunleekr Dec 21, 2023

ExplodingCabbage commented Dec 18, 2023

Improve word tokenization for non-Latin characters #328

Are you sure you want to change the base?

Improve word tokenization for non-Latin characters #328

Conversation

jihunleekr commented Sep 19, 2021

ExplodingCabbage Dec 18, 2023

Choose a reason for hiding this comment

jihunleekr Dec 21, 2023

Choose a reason for hiding this comment

ExplodingCabbage Dec 18, 2023

Choose a reason for hiding this comment

jihunleekr Dec 21, 2023

Choose a reason for hiding this comment

ExplodingCabbage Dec 18, 2023

Choose a reason for hiding this comment

jihunleekr Dec 21, 2023

Choose a reason for hiding this comment

ExplodingCabbage commented Dec 18, 2023