Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed StringWidth() implementation by using proper Unicode grapheme cluster segmentation. Fixes #28 #29

Merged
merged 7 commits into from Jan 11, 2021

Conversation

rivo
Copy link

@rivo rivo commented Aug 29, 2019

This introduces an implementation of StringWidth() using Unicode grapheme clusters which should be the correct way to split a string into its individual characters. The built-in assumption is that if we have combined runes (emojis, flags etc.), their width is the width of the first non-zero-width rune. Many of these combined runes were previously not handled correctly by this package.

Please note:

  • This introduces a dependency: https://github.com/rivo/uniseg/ (I don't see another way to make this work short of copying most code from rivo/uniseg over.)
  • No special zero-width joiner (ZWJ) handling is required anymore. Because code and especially configuration options related to ZWJ were thus removed, this branch is not backwards-compatible to the previous commit.
  • The changes result in a failure of the TestStringWidth test case but only the part where EastAsianWidth = true. I'm not very familiar with this flag so I don't know how to fix that. You may want to review this.

@mattn
Copy link
Owner

mattn commented Sep 1, 2019

Thanks your contribution. I'm thinking there are terminals which does not support ZWJ yet. So I prefer to disable ZWJ.

@mattn
Copy link
Owner

mattn commented Sep 2, 2019

How to add enable/disable ZWJ?

@rivo
Copy link
Author

rivo commented Sep 3, 2019

Using rivo/uniseg means we're doing proper grouping of runes now. There are many more rules than just the handling of a zero-width joiner, see here. As an example, flags such as 🇯🇵 consist of two runes but don't contain zero-width joiners, they follow other rules.

If you want to support terminals that don't render these characters, the question is how do they render them? If the difference is that some terminals render characters rune-by-rune, then a flag like UnicodeSegmentation would be more appropriate.

But singling out the zero-width joiner rule (which is only one of 13 rules) is probably not very useful.

Let me know what you think about this.

@mattn
Copy link
Owner

mattn commented Sep 3, 2019

Most of terminals which does not support ZWJ displays Unicode characters as what we can see. For example, 👪 is displayed as 👨‍👩‍👦.

@rivo
Copy link
Author

rivo commented Sep 3, 2019

So you're saying that these terminals support all kinds of Unicode characters (e.g. flags, Hangul, spacing/prepend marks) but ZWJ is the one they don't support?

Do you have an example for such a terminal?

@mattn
Copy link
Owner

mattn commented Sep 3, 2019

For example, gnome-terminal displays 👪 as 👨‍👩‍👦.

@mattn
Copy link
Owner

mattn commented Sep 3, 2019

Also ZeroWidthJoiner is already used in some projects.

https://github.com/4avka/tview/blob/d7a9d6c70ab205d578fc64f02f4e681ccb5b5180/util.go#L56

@rivo
Copy link
Author

rivo commented Sep 3, 2019

What about 뢴, 🇩🇪, , or 🏳️‍🌈? How do these display in a gnome terminal?

(tview is my project. I will remove the ZWJ handling. I only used it becausego-runewidth required setting it.)

@mattn
Copy link
Owner

mattn commented Sep 3, 2019

I'll check it in later. But probably

뢴 is 2 cells
🇩🇪 is 2 cells
ö is 1 or 2 cells
🏳️‍🌈 is 4 cells

@rivo
Copy link
Author

rivo commented Sep 3, 2019

Ok. Let me know when you got it.

image

Actually, here on GitHub, your answer suggests gnome terminal uses ZWJ properly. Both emojis look identical here.

@mattn
Copy link
Owner

mattn commented Sep 3, 2019

What I look is:

image

@rivo
Copy link
Author

rivo commented Sep 3, 2019

Interesting. Well, let me know what the result of these other characters is.

If we need to be able to turn off individual Unicode breaking rules, then that functionality needs to be implemented in uniseg as well. Actually, in my opinion, it would even be better to set these flags not here in go-runewidth but in uniseg because these are grapheme breaking flags and that's the package that does it. It would then automatically affect your package, too.

I'll wait for the widths of the characters you wanted to look up and we'll see then how to proceed.

@alecrabbit
Copy link

alecrabbit commented Sep 20, 2019

What about 뢴, 🇩🇪, ö, or 🏳️‍🌈? How do these display in a gnome terminal?

If it helps here how these chars look in my terminal(Ubuntu 18.04/Gnome)

// Character StringWidth uniseg.Graphemes
image

php-wcwidth's algo gets incorrect width for 뢴 -> 4, in terminal it displayed with 2

@mattn
Copy link
Owner

mattn commented Oct 22, 2019

@rivo Can you revert removing ZeroWidthJoiner?

@mattn
Copy link
Owner

mattn commented Jan 9, 2020

Any thought?

@rivo
Copy link
Author

rivo commented Jan 9, 2020

Apologies for the late reply. And thanks to @alecrabbit for providing the screenshot. Well, reintroducing the ZeroWidthJoiner (ZWJ) flag will not solve the issue with the family icon (👪) because it doesn't even contain a ZWJ code point (your own package, prior to this PR, gives 2 as a width, no matter how the flag is set). As mentioned before, the issue is larger than just the zero-width joiner topic. I.e. reverting will not solve much.

Also, the rivo/uniseg package has no way to disable individual rules so I don't even know how I would implement a ZWJ deactivation. Especially because it could have side effects on other rules.

If you insist on keeping that flag in there, I think you'll need to ignore this pull request.

I think it's generally difficult or even impossible to make this package work on all platforms. One would have to keep a database of how the different platforms render different characters. Since new emojis are introduced all the time, that's whole bunch of work. And it seems that not even popular packages get this right.

Here's iTerm on macOS 10.14:

image

Same with Apple's Terminal:

image

In short, I have no solution. There will always be someone who will complain. Unless someone is willing to put the effort into building a huge database of terminals+OSs and how they render characters, there will be mistakes.

But personally, I think this PR gets us closer to the correct way to handle this than the current version.

@mattn
Copy link
Owner

mattn commented Feb 29, 2020

Could you please fix conflict?

@rivo
Copy link
Author

rivo commented Feb 29, 2020

Conflicts are resolved. However, some tests still fail. Specifically "TestStringWidth" (see my last bullet point in my opening comment above) and "TestZeroWidthJoiner". Regarding "TestZeroWidthJoiner", the problem appears not to be the zero-width joiner but the fact that the white flag 🏳️ (U+1F3F3) is assigned a width of 1 although its actual width is 2.

My terminal (iTerm2 on macOS) seems to make the same mistake:

image

And so is Apple's native Terminal application:

image

So this seems to be a problem across applications. In go-runewidth, U+1F3F3 is not contained in the "doublewidth" table:

{0x1F3CF, 0x1F3D3}, {0x1F3E0, 0x1F3F0}, {0x1F3F4, 0x1F3F4},

I believe it should be in there. If you change the end of this line to

{0x1F3F3, 0x1F3F4}

the test will pass. (I can do it in this PR if you want.)

@mattn
Copy link
Owner

mattn commented Feb 29, 2020

the test will pass. (I can do it in this PR if you want.)

Yes, please

@rivo
Copy link
Author

rivo commented May 7, 2020

the test will pass. (I can do it in this PR if you want.)

Yes, please

I made that change and also fixed the test case failures resulting from it.

I also fixed the Truncate() function which had problems (see here for details).

Please note that as mentioned way above, the TestStringWidth test case still fails:

=== RUN   TestStringWidth
    TestStringWidth: runewidth_test.go:246: StringWidth("■㈱の世界①") = 10, want 12 (EA)
    TestStringWidth: runewidth_test.go:246: StringWidth("スター☆") = 7, want 8 (EA)
    TestStringWidth: runewidth_test.go:246: StringWidth("つのだ☆HIRO") = 11, want 12 (EA)

I'm not familiar with the EastAsianWidth flag so I hope you'll be able to fix this. If you need my input on this, please let me know.

@mattn
Copy link
Owner

mattn commented May 15, 2020

Thanks. Could you please fix conflicts?

@rivo
Copy link
Author

rivo commented May 15, 2020

Could you please tell me what changes were made in the main branch? I cannot fix the conflicts if I don't understand what changes were made in parallel.

@rivo
Copy link
Author

rivo commented May 15, 2020

Thanks. This still leaves the failing test regarding the EastAsianWidth flag. Please see my message above.

@rivo
Copy link
Author

rivo commented May 15, 2020

Your merge conflict resolutions removed the necessary changes that I made, which is why your test failed. I made them again. Tests pass now.

@mattn
Copy link
Owner

mattn commented May 15, 2020

@rivo
Copy link
Author

rivo commented May 15, 2020

I don't know why this is. I fixed these tests. It works for me, see here:

image

@gnojus
Copy link

gnojus commented Jun 17, 2020

Travis-ci is running go generate, which results in regeneration of runewidth_table.go. This changes back one value on line 52 that was changed in this PR and the test fails.

@mattn mattn mentioned this pull request Jul 11, 2020
@dolmen
Copy link

dolmen commented Jul 11, 2020

With this change this package isn't about rune (as defined in the Go spec) anymore but about graphemes. A package name change would be appropriate.

…day, the rainbow flag only has a width of 1.
@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2020

Codecov Report

Merging #29 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #29   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines          160       151    -9     
=========================================
- Hits           160       151    -9     
Impacted Files Coverage Δ
runewidth.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14e809f...f1f639b. Read the comment docs.

@rivo
Copy link
Author

rivo commented Jul 12, 2020

Thanks @nojus297 for this hint. It took me quite a long time to figure it out. It's a surprisingly difficult issue and it looks like there is not satisfying solution at the moment. So here we go:

go-runewidth uses the doublewidth table to check if a rune has a width of 2. That table is generated by extracting all runes from the East Asian Width table that have a property "W" which means "wide".

The white flag 🏳️ (which is the basis for the rainbow flag 🏳️‍🌈) is not contained in that table. It's probably new. The way I understand it, the East Asian Width table only includes emojis for historical reasons (East Asian code points were used when emojis were introduced).

Unicode Annex #11 "East Asian Width" says (Section 5, "Recommendations"):

[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

I believe go-runewidth should actually check the "emoji presentation table" and return a width of 2 if a rune is contained. However, there are two problems with this:

  1. go-runewidth does not maintain such a table. It only uses the Unicode "Extended_Pictographic" table but that table also contains emojis that have a width of 1.
  2. The white flag is not even contained in the Unicode "Emoji_Presentation" table. I don't know if this is a bug or if the recommendation I quoted above is expected to fail for some cases.

In any case, for now, I have changed the test such that for the rainbow flag, a width of 1 is expected. This appears to be in line with the official Unicode information I could find. And as mentioned further above, that's how the terminals also treat it. It's not ideal because obviously, the actual flag is wider than one character. But a solution would probably require the Unicode consortium to make changes (or at least clarify).

All checks are passing now.

@rivo
Copy link
Author

rivo commented Jan 11, 2021

Any thoughts?

@mattn
Copy link
Owner

mattn commented Jan 11, 2021

As I mentioned above, I don't want to remove ZeroWidthJoiner. So any part using uniseg should be separeted with flag ZeroWidthJoiner.

@rivo
Copy link
Author

rivo commented Jan 11, 2021

As I mentioned, my project tview is the only one using ZeroWidthJoiner. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.

As soon as this PR is merged, I will remove ZeroWidthJoiner from tview and then nobody will use it anymore.

@mattn
Copy link
Owner

mattn commented Jan 11, 2021

As I mentioned, my project tview is the only one using ZeroWidthJoiner. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.

Ah, I see. Thanks your explaining. Looks good to me. I'll check this soon.

@mattn mattn merged commit 59616a2 into mattn:master Jan 11, 2021
@mattn
Copy link
Owner

mattn commented Jan 11, 2021

Thank you!

1 similar comment
@rivo
Copy link
Author

rivo commented Jan 11, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants