Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng · 2024-01-04T15:29:09Z

Performance and Regression check

Tested with this benchmark

# Ignore some character that reline was calculating wrong
chars = (0..0x10ffff).filter_map{_1.chr('utf-8') rescue nil}.reject { _1 =~ /\p{M}/ && _1 !~ /\p{Mn}/ } - ["\u2e3b"];
measure
def Reline.ambiguous_width = 3 # Set to any number except 0, 1, 2 to make the checksum work better.
chars.map{Reline::Unicode.get_mbchar_width(_1)*_1.ord}.sum # checksum to ensure no regression

Result with yjit

master branch
processing time: 0.784064s #=> 924024865779
this branch
processing time: 0.404117s #=> 924024865779

Result without yjit

master branch
processing time: 0.900042s #=> 924024865779
this branch
processing time: 0.566599s #=> 924024865779

Implementation

Uses bsearch_index. Time complexity of bsearch_index is O(log(N)), N is total count of unicode characters.
We can also choose less-memory O(1) lookup (shallow tree with bignum, https://gist.github.com/tompng/6be795d487e1a0105ada41e24f9528c4) but the generated file will be unreadable.

I think this bsearch_index is a good choice because:

Most chars are ascii, so multibyte width calculation is not so important
The balance of performance, unicode.rb code simplicity, and readability of generated east_asian_width.rb are good

Bug fixes

Fixed these two type of chars. these are excluded from the performance/regression benchmark.

Nonspacing Mark

Reline returned 0 for /\p{M}/ (Mark). I think it was a mistake of /\p{Mn}/ (Nonspacing Mark).

# Chars matches /\p{M}/ but not /\p{Mn}/, 465 chars
marks = (0..0x10ffff).filter_map{''<<_1 rescue nil}.select{_1 =~ /\p{M}/ && _1 !~ /\p{Mn}/}
# Measure actual width in terminal emulator by "\e[6n" (Device Status Report)
marks.count{ $><< "\ra#{_1}b\e[6n";STDIN.raw{STDIN.readpartial(10)[/\e\[\d+;(\d+)R/, 1]}.to_i - 1 == 2 }
# =>
# 0 means /\p{Mn}/ is correct, 465 means /\p{M}/ is correct.
# Terminal.app: 0
# iTerm2: 36
# Alacritty: 13
# VSCode Terminal: 14

Three Em Dash

Reline returned 3 for three em dash "\u2e3b".
Reline returned 1 for two em dash "\u2e3a".
It's defined as N(Neutral) and shuold be 1. Terminal.app, VSCode, iTerm, Alacrytty uses width=1 (but overflows because font is very wide)

Calculate mbchar width with bsearch

0ceb65d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `Reline::Unicode.get_mbchar_width` #632

Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng commented Jan 4, 2024 •

edited

Improve performance of Reline::Unicode.get_mbchar_width #632

Are you sure you want to change the base?

Improve performance of Reline::Unicode.get_mbchar_width #632

Conversation

tompng commented Jan 4, 2024 • edited

Performance and Regression check

Implementation

Bug fixes

Nonspacing Mark

Three Em Dash

Improve performance of `Reline::Unicode.get_mbchar_width` #632

Improve performance of `Reline::Unicode.get_mbchar_width` #632

tompng commented Jan 4, 2024 •

edited