Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Reline::Unicode.get_mbchar_width #632

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tompng
Copy link
Member

@tompng tompng commented Jan 4, 2024

Performance and Regression check

Tested with this benchmark

# Ignore some character that reline was calculating wrong
chars = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.reject { _1 =~ /\p{M}/ && _1 !~ /\p{Mn}/ } - ["\u2e3b"];
measure
def Reline.ambiguous_width = 3 # Set to any number except 0, 1, 2 to make the checksum work better.
chars.map{Reline::Unicode.get_mbchar_width(_1)*_1.ord}.sum # checksum to ensure no regression

Result with yjit

master branch
processing time: 0.784064s #=> 924024865779
this branch
processing time: 0.404117s #=> 924024865779

Result without yjit

master branch
processing time: 0.900042s #=> 924024865779
this branch
processing time: 0.566599s #=> 924024865779

Implementation

Uses bsearch_index. Time complexity of bsearch_index is O(log(N)), N is total count of unicode characters.
We can also choose less-memory O(1) lookup (shallow tree with bignum, https://gist.github.com/tompng/6be795d487e1a0105ada41e24f9528c4) but the generated file will be unreadable.

I think this bsearch_index is a good choice because:

  • Most chars are ascii, so multibyte width calculation is not so important
  • The balance of performance, unicode.rb code simplicity, and readability of generated east_asian_width.rb are good

Bug fixes

Fixed these two type of chars. these are excluded from the performance/regression benchmark.

Nonspacing Mark

Reline returned 0 for /\p{M}/ (Mark). I think it was a mistake of /\p{Mn}/ (Nonspacing Mark).

# Chars matches /\p{M}/ but not /\p{Mn}/, 465 chars
marks = (0..0x10ffff).filter_map{''<<_1 rescue nil}.select{_1 =~ /\p{M}/ && _1 !~ /\p{Mn}/}
# Measure actual width in terminal emulator by "\e[6n" (Device Status Report)
marks.count{ $><< "\ra#{_1}b\e[6n";STDIN.raw{STDIN.readpartial(10)[/\e\[\d+;(\d+)R/, 1]}.to_i - 1 == 2 }
# =>
# 0 means /\p{Mn}/ is correct, 465 means /\p{M}/ is correct.
# Terminal.app: 0
# iTerm2: 36
# Alacritty: 13
# VSCode Terminal: 14

Three Em Dash

Reline returned 3 for three em dash "\u2e3b".
Reline returned 1 for two em dash "\u2e3a".
It's defined as N(Neutral) and shuold be 1. Terminal.app, VSCode, iTerm, Alacrytty uses width=1 (but overflows because font is very wide)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant