Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mixed CJK/Latin linebreaking. #1986

Merged
merged 1 commit into from Sep 6, 2022
Merged

Improve mixed CJK/Latin linebreaking. #1986

merged 1 commit into from Sep 6, 2022

Conversation

bigfarts
Copy link
Contributor

This avoids prioritizing kana above spaces and we will break at the first possible break location, rather than via an implicit order of breaking.

Before:
image
image

After:
image
image

@emilk
Copy link
Owner

emilk commented Sep 5, 2022

I have no knowledge of kana, but in English text I would say it is preferable to break at a space rather than at punctuation or dashes. Take for instance: `Temperature: 3.2 Kelvin". We do not want to break this as:

Temperature: 3.
2 Kelvin

So the current ordering is very deliberate when it comes to spaces, dashes and punctuation, and I don't want to break that.

If the problem is that spaces are prioritized over kana, let's just focus on that.

Perhaps something like:

let best = self.space.or(self.logogram).or(self.dash).or(self.punctuation);
let pos = match (best, kana) {
    (None, None) => None,
    (None, Some(pos)) => Some(pos),
    (Some(pos), None) => Some(pos),
    (Some(best), Some(kana)) => Some(best.max(kana)),
};
pos.or(self.any)

Or we special-case it based on whether or not there is kana:

if let Some(kana) = self.kana {
    // Whatever logic makes sense for kana
} else {
    self.space.or(self.logogram).or(self.dash).or(self.punctuation).or(self.any)
}

It would also be great if you added a test for this so we don't break it in the future!

@bigfarts bigfarts changed the title Treat all types of row breaking the same (except any). Improve mixed CJK/Latin linebreaking. Sep 6, 2022
@bigfarts
Copy link
Contributor Author

bigfarts commented Sep 6, 2022

This should be a better solution: breaking on CJK (kana/logogram (Hangul is not supported because I don't know much about Hangul)) is now prioritized at the same level as spaces, and also breaking before a CJK character is also prioritized at the same level of spaces. This handles cases like:

CJK break:

日本語と
Englishの混在
した文章

Pre-CJK break:

日本語とEnglish
の混在した文章

(actually the K part is a lie because I don't know much about Korean typesetting, but it should be easy to implement)

This changes the break on space rule to break on space, CJK, or pre-CJK, e.g.:

    aaaあああ
       ^ break is inserted here
@emilk
Copy link
Owner

emilk commented Sep 6, 2022

Great!

@emilk emilk merged commit 0e62c0e into emilk:master Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants