Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Width is 1 when it should be 2 #59

Open
rivo opened this issue Feb 16, 2022 · 16 comments
Open

Width is 1 when it should be 2 #59

rivo opened this issue Feb 16, 2022 · 16 comments

Comments

@rivo
Copy link

rivo commented Feb 16, 2022

I stumbled over a character that, when output to the console directly, takes up two characters. But StringWidth() gives me 1. This is because the first rune of this character has a width of 1 and that's what's being used, see here. I know I wrote this code and I'm sure that you cannot simply add up the widths of individual runes ("🏳️‍🌈" would then have a width of 4 which is obviously wrong) and using the first rune's width worked fine so far. But it turns out that it fails in some cases.

I'm not familiar with Indian characters but it seems to me that the second rune is a modifier that turns the character from a width of 1 into a width of 2. Are you aware of any logic that we could add to go-runewidth that makes this right?

Here's example code that illustrates the issue:

package main

import (
	"fmt"

	runewidth "github.com/mattn/go-runewidth"
)

func main() {
	s := "खा"
	fmt.Println("0123456789")
	fmt.Println(s + "<")
	fmt.Printf("String width: %d\n", runewidth.StringWidth(s))
	var i int
	for _, r := range s {
		fmt.Printf("Rune %s  (%d) width: %d\n", string(r), i, runewidth.RuneWidth(r))
		i++
	}
}

Output (on macOS with iTerm2):

image

@dricottone
Copy link

I don't think I've ever used StringWidth. I was introduced to this library by way of tcell and it's demos. You can see here that another strategy is to use RuneWidth in a loop.

And incidentally that code looks similar to the implementation of StringWidth, but without a lot of (probably important) edge case handling.

@rivo
Copy link
Author

rivo commented Feb 17, 2022

Yeah, we've been through this before. Just looking at individual runes is not enough to determine the width of a string. Characters can consist of multiple runes ("🏳️‍🌈" is composed of 4 runes and 14 bytes). StringWidth currently uses only the width of the first rune, which is a rule I implemented because I didn't know better. Now we know that this is not correct in some cases. But to be honest, I don't even know where to look for the correct rule. The width of characters on terminal screens is not part of any official specification.

I guess I'll have to dig into wcwidth.c which appears to be used by most terminals.

@ericwq
Copy link

ericwq commented May 21, 2022

runwidth can handle east asia character, but emoj/flag is very different. should be handled differently. https://unicode.org/emoji/charts/emoji-list.html

@rivo
Copy link
Author

rivo commented May 23, 2022

It seems that runewidth does not handle East Asian characters correctly, as you can see in the initial comment of this thread. I tried to look at other implementations of this but wasn't able to find a good answer as to how other packages solve this issue. The question remains, how these (and emoji) characters should be handled when composed of multiple runes.

@mattn
Copy link
Owner

mattn commented May 23, 2022

खा 0x0916 is not Ambiguous (A). It is Neutral (N).

https://unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt

image

@rivo
Copy link
Author

rivo commented May 23, 2022

What does this mean for the issue at hand? Can you elaborate?

@mattn
Copy link
Owner

mattn commented May 23, 2022

If the field is W, runewidth.StringWidth return 2. Table of runewidth is generated from EastAsianWidth.txt. So runewidith works correctly, I think.

@rivo
Copy link
Author

rivo commented May 23, 2022

If the field is W, runewidth.StringWidth return 2.

I don't know what you mean by that. As I mentioned above, it currently returns 1 for the character at hand.

So are you saying we should add up the rune widths of all runes this character is composed of? Then how do we deal with emojis?

(I know I keep repeating myself here but maybe I'm not expressing myself clear enough.)

@mattn
Copy link
Owner

mattn commented May 23, 2022

Ah, sorry. I did misreading. I didn't understand खा have two runes.

@ericwq
Copy link

ericwq commented May 23, 2022

Here is my fix. It's not perfect, but it works for me, tested with combining character, chinese, emoji, flags.

// In the loop, national flag's width got 1+1=2.
func runesWidth(runes []rune) (width int) {
	// quick pass for iso8859-1
	if len(runes) == 1 && runes[0] < 0x00fe {
		return 1
	}

	cond := runewidth.NewCondition()
	cond.StrictEmojiNeutral = false
	cond.EastAsianWidth = true

	// loop for multi rune
	width = 0
	for i := 0; i < len(runes); i++ {
		width += cond.RuneWidth(runes[i])
	}

	return width
}

There is a precondition: You must use uniseg to split it into multi runes.

graphemes := uniseg.NewGraphemes(str)
for graphemes.Next() {
	input = graphemes.Runes()
}

@rivo
Copy link
Author

rivo commented Jul 27, 2022

I just tried this and it gives me:

StringWidth("🏳️‍🌈") == 6

where it should be 2.

I think we're all underestimating this topic. If there's one thing I learned, it's that Unicode handling is complicated. Especially for emojis, there are various ways of constructing them, and simply adding up code point widths (or using the width of the first code point, like I suggested earlier) will not work.

I guess I have to do the grunt work and get into all the details.

@rivo
Copy link
Author

rivo commented Sep 11, 2022

Ok, so I went down the rabbit hole of calculating the monospace width of strings, like this package attempts. Unfortunately, the whole premise of doing this code point by code point (or, in Golang terms, rune by rune) will never lead to satisfactory results. It works ok for non-English Characters, e.g. 世界 or खा. You can simply add up the width of all individual runes.

But that completely fails for emojis. If you look at Unicode Technical Standard #51, you'll find that emojis can be encoded in tons of different ways. Zero-Width Joiner is just one small part of it. There are modifiers that force text presentation (=width of 1), modifiers that force normal characters like the digit "1" into emoji presentation (=width of 2), there are regional indicators (flags), complex emoji sequences, just to name a few.

To handle these correctly, we need to look at grapheme clusters as a whole. We already attempted this in this package, but it requires a lot more analysis of these clusters than what's happening now. I don't actually know how the original wcwidth handles emojis but similar packages like the Ruby version, janlelis/unicode-display_width, makes that extension optional, i.e. there is special handling for emojis. I also found that many terminals will render some emojis with the wrong width. So I guess wcwidth (presumably used by most of them) still has some flaws.

By now, I've implemented my own version of wcwidth in rivo/uniseg. uniseg already has all the tools in place for grapheme cluster parsing, code point classification etc. You can check out the algorithm here.

As for this package and this issue, I think there is value in being compatible with the original wcwidth functionality, as this will allow us to write Go programs that work on most terminals (assuming most terminals use wcwidth). If someone has the time, please research how wcwidth handles emojis. In my opinion, this would be the optimal solution.

For a quick fix, though, I would suggest the following:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), always return a total width of 2.

This is not perfect but it will work for most use cases.

If you want to do emoji handling in a better way, another quick fix would be:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), use the width returned by rivo/uniseg.

We're already using rivo/uniseg to determine the grapheme clusters so this will come for free. (This will still not handle conjoining Hangul characters correctly but maybe that's ok.)

If you want to get rid of rivo/uniseg again, sure, you can go back to adding up all rune widths again like it was in the beginning of this package. But many emojis will have the wrong width then. It's not something I would suggest but this is not my package, I'm not the one to decide.

@rivo
Copy link
Author

rivo commented Sep 11, 2022

I updated PR #63 to reflect this. It's the "quick fix" solution I described above.

@ericwq
Copy link

ericwq commented Sep 11, 2022

glad to hear that. i will give it a try if i find some time.

@mattn
Copy link
Owner

mattn commented Sep 11, 2022

Thanks. I'll look into it in later.

@ericwq
Copy link

ericwq commented Sep 29, 2022

@rivo It works perfectly. Thanks for your contribution!

junegunn pushed a commit to junegunn/go-runewidth that referenced this issue Jan 7, 2024
… to speedier version.

Adapted to short-notice change in rivo/uniseg.

Upgraded to latest rivo/uniseg. Also implemented basic emoji handling in StringWidth. See mattn#59

Added a test for Truncate that includes emojis.

Split the code so it can be upgraded once we move to Go1.18+

The wrong uniseg version was used. Fixed it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants