Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve word tokenization for non-Latin characters #328

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jihunleekr
Copy link

Diff.diffWords is not working on non-Latin characters like Korean.

'\u0000-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E'; // Basic Latin
charsCannotBecomeWord += '\u00A0-\u00BF\u00D7\u00F7'; // Latin-1 Supplement
charsCannotBecomeWord += '\u02B9-\u02DD\u02E5-\u02FF'; // Spacing Modifier Letters
charsCannotBecomeWord += '\u0300-\u036F'; // Combining Diacritical Marks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this can't be right. Treating combining diacritics as word breaks is definitely wrong for European languages that I am familiar with (and I would've thought that it was wrong universally), since it leads to outcomes like this:

> a = 'festou\u0300'
'festoù'
> b = 'impube\u0300re'
'impubère'
> diff.diffWords(a, b)
[
  { count: 1, added: undefined, removed: true, value: 'festou' },
  { count: 1, added: true, removed: undefined, value: 'impube' },
  { count: 1, value: '̀' },
  { count: 1, added: true, removed: undefined, value: 're' }
]

That diff ought to just show the one word in the old text being deleted and a totally new word from the new text being added, but instead it shows us preserving the code point for the grave accent in the middle of the word and adding and deleting stuff around the accent. That's nonsense from the perspective of French or any other language I know with accents; the (an e with a grave accent) in impubère is simply a letter in the word, just like the b and r next to it, and doesn't semantically represent a word break.

Do combining diacritics work differently in Korean (... if they even exist in Korean?) or some other language you're familiar with such that it makes sense to treat them as word breaks? I am afraid I don't speak any non-European languages and am a bit out of my depth! I can imagine in theory a language where accents get used as punctuation, but I have never encountered such a thing...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered such cases due to my limited experience with other languages. It seems like an aspect that requires further thought.

Comment on lines +29 to +49
const tokens = [];
let prevCharType = '';
for (let i = 0; i < value.length; i++) {
const char = value[i];
if (spaceRegExp.test(char)) {
if(prevCharType === 'space') {
tokens[tokens.length - 1] += ' ';
} else {
tokens.push(' ');
}
prevCharType = 'space';
} else if (cannotBecomeWordRegExp.test(char)) {
tokens.push(char);
prevCharType = '';
} else {
if(prevCharType === 'word') {
tokens[tokens.length - 1] += char;
} else {
tokens.push(char);
}
prevCharType = 'word';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need logic changes here? It's not obvious to me why the fix for Korean text should involve anything more than just treating Korean letters as letters instead of word boundaries...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is also an excessive change. My intention was to separate characters that should exist as single characters into individual tokens. (For example: U+0021, U+0022, ...)

const char = value[i];
if (spaceRegExp.test(char)) {
if(prevCharType === 'space') {
tokens[tokens.length - 1] += ' ';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the edge case where there's a long run of spaces in the text somewhere, this is going to take O(n^2) time where n is the number of consecutive spaces.

It also rewrites all space characters of any kind to an ordinary ' '.

Both these aspects of the behaviour here seem wrong!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems like an excessive change.

@ExplodingCabbage
Copy link
Collaborator

If I understand right, the problem is that right now we treat all CJK characters as if they were punctuation marks / word breaks, and the fix here treats them as letters instead. But:

  • the fix here also messes with combining diacritics in ways that seem to me to break existing working behaviour for languages with accents
  • the fix also changes other aspects of the logic of tokenize beyond which characters are treated as letters vs word breaks, and I can't figure out why
  • the fix doesn't help us with Japanese or Chinese since those languages don't use spaces (and need a fundamentally different tokenization algorithm like the one provided by Intl.Segmenter). Doesn't by itself make doing this a bad idea, but makes me wonder if we ought to be making a more radical change...

I'll come back to this in due course. Would love to get your input in the meantime, @jihunleekr, but I understand if in the 2 years since you opened this PR you've lost interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants