Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zwj codepoints, skin tones, families, and kisses #2

Closed
isaacs opened this issue Jul 20, 2016 · 19 comments Β· Fixed by #39
Closed

zwj codepoints, skin tones, families, and kisses #2

isaacs opened this issue Jul 20, 2016 · 19 comments Β· Fixed by #39

Comments

@isaacs
Copy link

isaacs commented Jul 20, 2016

Consider these various glyphs:

  1. πŸ‘Ά
  2. πŸ‘ΆπŸ½
  3. πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦
  4. πŸ‘¨β€β€οΈβ€πŸ’‹β€πŸ‘¨

The first is a generic "simpsons-colored" baby. This module correctly interprets it as a single column. (One might argue it really ought to be considered full-width, or 2 columns, since most terminals render emoji as extra wide, but one would be wrong to make that argument, because most terminals also "incorrectly" overlap the next character on top of the emoji, so it actually only "takes up" one column.)

The second is a baby with a specific skin tone. This module doesn't handle the zero-width-joiner (or "zwj", pronounced "zwidge") properly, so it reads as 2 columns.

The third is a "woman [zwj] woman [zwj] boy [zwj] boy". It's a full 25 bytes of familial goodness, and this module treats it as 7 columns.

The fourth is "man [zwj] heart [zwj] kiss [zwj] man", and comes in at 8 columns.

Is this problem even solvable? Conceivably, something like "fireman [zwj] cat" could be turned into "fire cat" by Apple or Google or Microsoft tomorrow, and a current 2 column set of code points could become 1.

If not, it seems like maybe it should be called out in the readme as just an impossible thing we can never hope to account for? Another way would be to optimistically treat anything with zero width joiners as single chars, but that might be too optimistic?

@sindresorhus
Copy link
Owner

I honestly didn't even know about zwj codepoints until a few weeks ago.

Calling @mathiasbynens (Unicode wizard). Do you happen to know if there's any way to do this? Maybe you happen to have a module for it. ;)

@isaacs
Copy link
Author

isaacs commented Jul 20, 2016

Is it just me, or does the first baby look a lot bigger?

screenshot 2016-07-20 15 19 58

@sindresorhus
Copy link
Owner

Yup, GitHub are "nice" enough to replace some emojis with web components...

<g-emoji alias="baby" fallback-src="https://assets-cdn.github.com/images/icons/emoji/unicode/1f476.png">πŸ‘Ά</g-emoji>

Same thing here: https://github.com/sindresorhus/skin-tone

screen shot 2016-07-21 at 00 41 12

@mathiasbynens
Copy link

mathiasbynens commented Jul 22, 2016

For inspiration, take a look at how lodash attempts to solve this (see its internal stringToArray function). +@jdalton

Is this problem even solvable?

This is the right question. There’s no way to detect whether the current environment renders the given set of code points as a single grapheme/glyph/emoji, which is what you really want here.

As for emoji + ZWJ, you could programmatically account for the combinations listed here: http://unicode.org/emoji/charts/emoji-zwj-sequences.html But that list changes over time, doesn’t necessarily reflect the environment your code runs in, and excludes non-emoji uses of ZWJ.

@jdalton
Copy link

jdalton commented Jul 22, 2016

A link to the stringToArray reference.

Woo!

_.size('πŸ‘Ά') // 1
_.size('πŸ‘ΆπŸ½') // 1
_.size('πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦') // 1
_.size('πŸ‘¨β€β€οΈβ€πŸ’‹β€πŸ‘¨') // 1

@isaacs
Copy link
Author

isaacs commented Jul 22, 2016

@jdalton I think @mathiasbynens hit the nail on the head. Observe:

screenshot 2016-07-22 09 32 34

screenshot 2016-07-22 09 31 57

I'm ok with closing this issue with a doc patch, but it seems like the only way to get the right answer programmatically is an option to specify whether zwj chars should be respected, or perhaps even an option to specify which zwj combinations should be treated as combined.

@jdalton
Copy link

jdalton commented Jul 22, 2016

To me if the user specifies something with a zwj it's their intent to consider it as part of the joined whole regardless of how the device renders the emoji. Our methods like _.truncate respect it too. For example, you wouldn't want the family unit emoji to truncate the siblings breaking the family apart.

@mathiasbynens
Copy link

@jdalton I don’t disagree β€” for all scenarios in which you’d want to count grapheme clusters you’d want πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ to count as a single unit. But note that this project is about getting the visual width of a string, which is inherently dependent on the environment it runs in.

@jdalton
Copy link

jdalton commented Jul 22, 2016

Ya, no worries. I just popped in, after being mentioned, to give a woo & a πŸ™Œ for fancy emojis.

@eamodio
Copy link

eamodio commented Sep 18, 2017

I know this issue is pretty old, but is there any solutions (or workable hacks) for this?

@Offirmo
Copy link

Offirmo commented Sep 22, 2017

@eamodio I'm on it, if everything goes well...

@eamodio
Copy link

eamodio commented Sep 22, 2017

@Offirmo Thanks! FYI, I'm not sure it helps with this, but I hacked together a solution for my use-case here: https://github.com/eamodio/vscode-gitlens/blob/99d6da9c9032e244a3dcaeb6f86ca65eeebfbd8c/src/system/string.ts#L130-L188

@Offirmo
Copy link

Offirmo commented Sep 24, 2017

@eamodio thanks! I had a look at your implementation and I believe my pending one will be more generic (for ex. emojis are not always taking 2 cols). But interesting read!

@eamodio
Copy link

eamodio commented Sep 24, 2017

@Offirmo thanks. Definitely looking forwards to a more robust generic solution!

@unjello
Copy link

unjello commented May 9, 2018

Hi there? I know it's a bit old... but I think ⚠️and πŸ›‘ are getting me in same trouble. They're interpreted as 2 columns, although visually they're definitely 1. Any update?

@eight04
Copy link

eight04 commented May 9, 2018

I had switched to power-assert-util-string-width, which depends on eastasianwidth.

@lvleihere
Copy link

so, done ?

@Offirmo
Copy link

Offirmo commented Jan 30, 2019

I stopped working on it, sorry. 1) it ended up being super complicated 2) it needed a refacto of this lib and @sindresorhus wasn't keen on changing the API.

@sindresorhus
Copy link
Owner

I had switched to power-assert-util-string-width, which depends on eastasianwidth.

This package now depends on that too.

fisker added a commit to fisker/string-width that referenced this issue Feb 28, 2022
@fisker fisker mentioned this issue Feb 28, 2022
sindresorhus pushed a commit that referenced this issue Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants