Variation Selector 15 (VS-15, U+FE0E) support. #120

jquast · 2024-02-14T20:04:23Z

I did a few spot checks of VS-15 when implementing VS-16, and erroneously believed that all emojis in VS-15 sequences were already listed as by EastAsianWidth.txt as width of 1.

But that's not true. There are several emojis that are "wide" that are changed to "narrow" with VS-15.

Reported by @rivo in muesli/reflow#73 (comment)

@rivo: you declare that our "Specification" is "missing some things", I would appreciate any further things that you find wrong.

I did a few spot checks of VS-15 when implementing VS-16, and erroneously believed that all emojis in VS-15 sequences were already listed as an EAW width of 1. But that's not true. There are several emojis that are "wide" that are changed to "narrow" with VS-15.

codecov · 2024-02-14T20:11:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (056ee4b) 100.00% compared to head (f00fba5) 100.00%.

Additional details and impacted files

@@            Coverage Diff            @@
##            master      #120   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         6    +1     
  Lines          105       115   +10     
  Branches        25        28    +3     
=========================================
+ Hits           105       115   +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add any additional U+FE0F/U+FE0E check in sequence of wcswidth() to ensure 100% code coverage

penguinolog · 2024-02-15T15:37:09Z

tox.ini

 basepython = python3.11
 commands = {envbindir}/pylint --rcfile={toxinidir}/.pylintrc \
-           --ignore=tests,docs,setup.py,conf.py,build,distutils,.pyenv,.git,.tox \
+           --ignore=tests,docs,setup.py,conf.py,build,distutils,.pyenv,.git,.tox,table_wide.py,table_vs15.py \


IMHO, this is for .pylintrc / pyproject.toml section for pylint

[tool:pylint] may also be included in tox.ini, but I decided it was less complex to just include it here with the others

penguinolog · 2024-02-15T15:38:28Z

wcwidth/wcwidth.py

@@ -201,6 +202,17 @@ def wcswidth(pwcs, n=None, unicode_version='auto'):
                last_measured_char = None
            idx += 1
            continue
+        if char == u'\uFE0E' and last_measured_char:


u'\uFE0E' - maybe time kill python 2.7?

rivo · 2024-02-15T23:01:26Z

@rivo: you declare that our "Specification" is "missing some things", I would appreciate any further things that you find wrong.

I guess VS15 was the main one. I noticed other minor things like U+FF9E or U+FF9F which the grapheme cluster spec lists as "extending characters", thus width of 0 IMO but would get a width of 1 according to your algorithm, thus, e.g. ｷﾞ would result in a width of 2.

I'm also not sure what exactly this means:

Any character following ZWJ (U+200D) when in sequence by function wcwidth.wcswidth().

How do you determine when such a sequence ends? E.g. how do you determine the width of a string such as this one:

👩‍👩‍👦‍👦abc

I personally also think that characters such as U+2E3B should be much wider (width of 1 in your library). I've seen it span at least 4 cells in various macOS applications.

There may be more, or maybe not, I don't know. Validating your algorithm would be a lot of effort.

I put the word Specification in quotes because while you do note that it's a description of your implementation, I find that writing "I authored a formal Specification detailing how characters should be measured" makes it sound more authoritative than it is, especially in the context of assigning grades to other applications based on how well they "perform". In my comment, I was trying to make the point that there is no official specification and so far, every attempt I have seen has had flaws.

Even my own implementation is not perfect. Consider U+FDFD which both of our libraries will assign a width of 1:

U+FDFD: ﷽

In iTerm2, it spans 5 cells. In Chrome on macOS with a monospace font, it's even 11 cells wide.

It would be interesting to render out these characters on different platforms with the most common fonts and check how their widths compare to our calculated widths. This might reveal some general flaws or at the very least it would identify outliers such as U+FDFD.

I haven't had time for such a project yet, though.

GalaxySnail · 2024-02-16T10:40:58Z

I noticed other minor things like U+FF9E or U+FF9F which the grapheme cluster spec lists as "extending characters", thus width of 0 IMO but would get a width of 1 according to your algorithm, thus, e.g. ｷﾞ would result in a width of 2.

Shouldn't the width of ｷﾞ be 2? This string is aligned with East Asian Wide characters in my web browser.

ｷﾞ一二三
一二三四

rivo · 2024-02-16T11:38:43Z

Shouldn't the width of ｷﾞ be 2?

This is U+FF77 and U+FF9E. U+FF77 is a half-width character, thus width=1. Is U+FF9E a separate character? The Grapheme Cluster Spec says it's an "extending character" and those typically don't take up extra space. They typically fall into the "Mn" category.

I'm aware that some fonts will still create extra space for them, which is likely why it looks ok in your browser. I guess it's nothing big to worry about. (That's why I wrote "minor" above.)

jquast · 2024-02-16T17:20:25Z

Thank you for your feedback @rivo I honestly appreciate it,

Any character following ZWJ (U+200D) when in sequence by function wcwidth.wcswidth().

How do you determine when such a sequence ends? E.g. how do you determine the width of a string such as this one:

👩‍👩‍👦‍👦abc (U+1f469, U+200d, U+1f469, U+200d, U+1f466, U+200d, U+1f466, 'a', 'b', 'c')

The characters that follow U+200d are not counted. So the first character, U+1f469 is of width 2, but the characters following the three U+200d's (U+1f469, U+1f466, U+1f466) are not counted, while 'abc' is counted normally. So the final width is 2 + 3 = 5:

>>> import wcwidth
>>> wcwidth.wcswidth('👩‍👩‍👦‍👦abc')
5

The algorithm in wcswidth is fairly basic,

wcwidth/wcwidth/wcwidth.py

Lines 189 to 192 in 056ee4b

    
           if char == u'\u200D': 
        
               # Zero Width Joiner, do not measure this or next character 
        
               idx += 2 
        
               continue

There may be more, or maybe not, I don't know. Validating your algorithm would be a lot of effort.

Maybe you missed the details of the ucs-detect tool that I have written, but it does validate the ZWJ algorithm is 100% compliant with Konsole, foot, iTerm2, and WezTerm. (A+ score for "ZWJ" at https://ucs-detect.readthedocs.io/results.html)

I find that writing "I authored a formal Specification detailing how characters should be measured" makes it sound more authoritative than it is

My apologies for that. I will modify all references to be very specific that it is the "Specification of how python wcwidth package measures..." etc, I can only say that I try to be as terse as possible. Because wcwidth is of interest to non-english speakers that may be reading it with difficulty or through translation, I try not to mince too many words, so it may come across as more authoritative than I intended.

U+FDFD: ﷽

In iTerm2, it spans 5 cells. In Chrome on macOS with a monospace font, it's even 11 cells wide.

As for these kinds of scripts (Arabic in this case), fixed-width fonts and monospace constraints of a terminal is not appropriate. We can only do so much to interpret unicode.org specifications for the terminal environment, but measuring this kind of script isn't possible with the data files that they publish. Supporting this kind of thing would require digging into the font and its rendering engine, which wouldn't be very reasonable to implement for a general purpose command-line application library. Even terminal emulators don't often dig into the font engine. I don't believe that folks who use these languages will be very successful in designing interactive curses applications.

Aside, I do wish for there to be a terminal sequence to display variable-width fonts and be released from the constraints of monospacing for such languages. Such a sequence could be used or detected at the application level to assume that the position is indeterminate, and rely on "cursor position report" queries for only an approximation of the nearest current cell.

Because popular multi-language terminals (mlterm, foot, iTerm2, Konsole) measure it as width of 1 then that is what I wish for my library to report. I don't wish to invent any new specification or standards, I apologize if it is ever interpreted that way, I will include more phrasing in the README.rst to make that clear. For example, if I found a statement in a unicode.org document that disagreed with all popular terminals, I would rather our library and specification match the most popular terminals.

rivo · 2024-02-17T13:30:39Z

I appreciate your thoughtful response.

monospace constraints of a terminal

Our goals may differ but in my Golang implementation, I don't think I mention terminals at all. Of course, it's a common area where these libraries are being used. But, for example, lots of people use VS Code and other IDEs which also use monospace fonts (except for the few who swear by variable fonts for programming but let's ignore them for now). Increasingly, such editors are used for more than just programming. Markdown, for example, appears to be integrated in more and more contexts (e.g. blog publishing, note taking, e-book authoring) and since IDEs have good support for writing Markdown, people will naturally gravitate towards using them. So I expect that these kind of algorithms are / will be relevant outside of terminals and for many different languages.

You are completely right in that there is value in offering an algorithm that matches the most commonly used terminals, even when they're "wrong". There is no point in deciding a character is 2 cells wide when all terminals render it in 1 cell. In tview, my terminal UI library, I can say ﷽ is 5 cells wide but when it gets rendered out to the terminal, I need to assume the terminal assumes it to be 1 cell, therefore I have to output another 4 space characters to match my own interpretation. So I need both my own width library and the wcwidth library which matches most terminals.

Jules-Bertholet · 2024-02-28T02:56:12Z

wcwidth/table_vs15.py

+        (0x02b55, 0x02b55,),  # Heavy Large Circle
+        (0x03030, 0x03030,),  # Wavy Dash
+        (0x0303d, 0x0303d,),  # Part Alternation Mark
+        (0x03297, 0x03297,),  # Circled Ideograph Congratulation


CJK ideographs should never be narrow, emoji presentation or not.

jquast added 3 commits February 14, 2024 15:04

Set PR hyperlink in changelog

7542270

Merge branch 'master' into jq/vs15

eb4cc23

jquast marked this pull request as ready for review February 14, 2024 20:10

Increase code coverage

f00fba5

Add any additional U+FE0F/U+FE0E check in sequence of wcswidth() to ensure 100% code coverage

jquast mentioned this pull request Feb 14, 2024

Add Variation Selector-15 support jquast/ucs-detect#13

Draft

penguinolog reviewed Feb 15, 2024

View reviewed changes

jquast mentioned this pull request Feb 15, 2024

Drop support for EOL Python 2.7 and 3.5 #117

Open

Jules-Bertholet reviewed Feb 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variation Selector 15 (VS-15, U+FE0E) support. #120

Variation Selector 15 (VS-15, U+FE0E) support. #120

jquast commented Feb 14, 2024 •

edited

codecov bot commented Feb 14, 2024 •

edited

penguinolog Feb 15, 2024

jquast Feb 15, 2024

penguinolog Feb 15, 2024

rivo commented Feb 15, 2024

GalaxySnail commented Feb 16, 2024

rivo commented Feb 16, 2024

jquast commented Feb 16, 2024 •

edited

rivo commented Feb 17, 2024

Jules-Bertholet Feb 28, 2024

Variation Selector 15 (VS-15, U+FE0E) support. #120

Are you sure you want to change the base?

Variation Selector 15 (VS-15, U+FE0E) support. #120

Conversation

jquast commented Feb 14, 2024 • edited

codecov bot commented Feb 14, 2024 • edited

Codecov Report

penguinolog Feb 15, 2024

Choose a reason for hiding this comment

jquast Feb 15, 2024

Choose a reason for hiding this comment

penguinolog Feb 15, 2024

Choose a reason for hiding this comment

rivo commented Feb 15, 2024

GalaxySnail commented Feb 16, 2024

rivo commented Feb 16, 2024

jquast commented Feb 16, 2024 • edited

rivo commented Feb 17, 2024

Jules-Bertholet Feb 28, 2024

Choose a reason for hiding this comment

jquast commented Feb 14, 2024 •

edited

codecov bot commented Feb 14, 2024 •

edited

jquast commented Feb 16, 2024 •

edited