Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve word tokenization for non-Latin characters #328

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
66 changes: 34 additions & 32 deletions src/diff/word.js
Original file line number Diff line number Diff line change
@@ -1,25 +1,19 @@
import Diff from './base';
import {generateOptions} from '../util/params';

// Based on https://en.wikipedia.org/wiki/Latin_script_in_Unicode
//
// Ranges and exceptions:
// Latin-1 Supplement, 0080–00FF
// - U+00D7 × Multiplication sign
// - U+00F7 ÷ Division sign
// Latin Extended-A, 0100–017F
// Latin Extended-B, 0180–024F
// IPA Extensions, 0250–02AF
// Spacing Modifier Letters, 02B0–02FF
// - U+02C7 ˇ ˇ Caron
// - U+02D8 ˘ ˘ Breve
// - U+02D9 ˙ ˙ Dot Above
// - U+02DA ˚ ˚ Ring Above
// - U+02DB ˛ ˛ Ogonek
// - U+02DC ˜ ˜ Small Tilde
// - U+02DD ˝ ˝ Double Acute Accent
// Latin Extended Additional, 1E00–1EFF
const extendedWordChars = /^[a-zA-Z\u{C0}-\u{FF}\u{D8}-\u{F6}\u{F8}-\u{2C6}\u{2C8}-\u{2D7}\u{2DE}-\u{2FF}\u{1E00}-\u{1EFF}]+$/u;
const spaceChars = ' \f\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff';
let charsCannotBecomeWord = '';
charsCannotBecomeWord += '\n\r';
charsCannotBecomeWord +=
'\u0000-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E'; // Basic Latin
charsCannotBecomeWord += '\u00A0-\u00BF\u00D7\u00F7'; // Latin-1 Supplement
charsCannotBecomeWord += '\u02B9-\u02DD\u02E5-\u02FF'; // Spacing Modifier Letters
charsCannotBecomeWord += '\u0300-\u036F'; // Combining Diacritical Marks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this can't be right. Treating combining diacritics as word breaks is definitely wrong for European languages that I am familiar with (and I would've thought that it was wrong universally), since it leads to outcomes like this:

> a = 'festou\u0300'
'festoù'
> b = 'impube\u0300re'
'impubère'
> diff.diffWords(a, b)
[
  { count: 1, added: undefined, removed: true, value: 'festou' },
  { count: 1, added: true, removed: undefined, value: 'impube' },
  { count: 1, value: '̀' },
  { count: 1, added: true, removed: undefined, value: 're' }
]

That diff ought to just show the one word in the old text being deleted and a totally new word from the new text being added, but instead it shows us preserving the code point for the grave accent in the middle of the word and adding and deleting stuff around the accent. That's nonsense from the perspective of French or any other language I know with accents; the (an e with a grave accent) in impubère is simply a letter in the word, just like the b and r next to it, and doesn't semantically represent a word break.

Do combining diacritics work differently in Korean (... if they even exist in Korean?) or some other language you're familiar with such that it makes sense to treat them as word breaks? I am afraid I don't speak any non-European languages and am a bit out of my depth! I can imagine in theory a language where accents get used as punctuation, but I have never encountered such a thing...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered such cases due to my limited experience with other languages. It seems like an aspect that requires further thought.

charsCannotBecomeWord += '\u1000-\u1FAFF'; // Mahjong Tiles - Symbols and Pictographs Extended-A
charsCannotBecomeWord += '\u2000-\u2BFF'; // General Punctuation - Miscellaneous Symbols and Arrows
charsCannotBecomeWord += '\u3000-\u303F'; // CJK Symbols and Punctuation
const spaceRegExp = new RegExp(`[${spaceChars}]`);
const cannotBecomeWordRegExp = new RegExp(`[${charsCannotBecomeWord}]`);

const reWhitespace = /\S/;

Expand All @@ -32,21 +26,29 @@ wordDiff.equals = function(left, right) {
return left === right || (this.options.ignoreWhitespace && !reWhitespace.test(left) && !reWhitespace.test(right));
};
wordDiff.tokenize = function(value) {
// All whitespace symbols except newline group into one token, each newline - in separate token
let tokens = value.split(/([^\S\r\n]+|[()[\]{}'"\r\n]|\b)/);

// Join the boundary splits that we do not consider to be boundaries. This is primarily the extended Latin character set.
for (let i = 0; i < tokens.length - 1; i++) {
// If we have an empty string in the next field and we have only word chars before and after, merge
if (!tokens[i + 1] && tokens[i + 2]
&& extendedWordChars.test(tokens[i])
&& extendedWordChars.test(tokens[i + 2])) {
tokens[i] += tokens[i + 2];
tokens.splice(i + 1, 2);
i--;
const tokens = [];
let prevCharType = '';
for (let i = 0; i < value.length; i++) {
const char = value[i];
if (spaceRegExp.test(char)) {
if(prevCharType === 'space') {
tokens[tokens.length - 1] += ' ';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the edge case where there's a long run of spaces in the text somewhere, this is going to take O(n^2) time where n is the number of consecutive spaces.

It also rewrites all space characters of any kind to an ordinary ' '.

Both these aspects of the behaviour here seem wrong!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems like an excessive change.

} else {
tokens.push(' ');
}
prevCharType = 'space';
} else if (cannotBecomeWordRegExp.test(char)) {
tokens.push(char);
prevCharType = '';
} else {
if(prevCharType === 'word') {
tokens[tokens.length - 1] += char;
} else {
tokens.push(char);
}
prevCharType = 'word';
Comment on lines +29 to +49
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need logic changes here? It's not obvious to me why the fix for Korean text should involve anything more than just treating Korean letters as letters instead of word boundaries...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is also an excessive change. My intention was to separate characters that should exist as single characters into individual tokens. (For example: U+0021, U+0022, ...)

}
}

return tokens;
};

Expand Down
1 change: 1 addition & 0 deletions test/diff/word.js
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ describe('WordDiff', function() {
it('should token unicode characters safely', function() {
expect(wordDiff.removeEmpty(wordDiff.tokenize('jurídica'))).to.eql(['jurídica']);
expect(wordDiff.removeEmpty(wordDiff.tokenize('wir üben'))).to.eql(['wir', ' ', 'üben']);
expect(wordDiff.removeEmpty(wordDiff.tokenize('안녕.'))).to.eql(['안녕', '.']);
});

it('should include count with identity cases', function() {
Expand Down