Update to Unicode 9.0.0 #10

mbrubeck · 2016-12-13T00:17:15Z

This is a work-in-progress update to Unicode 9.0.0. I'm submitting it here in case someone else wants to work on it, because I'm not sure whether I'll have time to finish it soon.

The grapheme cluster changes are all implemented and the tests are all passing, but there are some new inefficiencies in reverse iteration over grapheme clusters. See the "TODO" comments for details.

The word boundary changes are mostly not implemented yet, so some word boundary tests are failing.

Manishearth · 2016-12-20T02:04:49Z

I tried an optimization for the regional indicators in Manishearth@2297574 . Does that make sense, or did you mean something else?

The state doesn't persist when you jump between forward and reverse iteration because that sounds like a rare use case and I don't want to add extra overhead to the forward iteration case.

Manishearth · 2016-12-20T22:02:02Z

Added support for the RI stuff too. The tests don't seem to handle flags at all? This was completely missed by the tests, which isn't good. Can we contribute some tests? Simple test is that 🇦🇫🇦🇽🇦🇱🇩🇿🇦🇸🇦🇩🇦🇴 should decompose into individual flags when word-sliced.

Manishearth · 2016-12-20T22:17:54Z

Yeah, none of their tests contain more than two regional indicators :|

Manishearth · 2016-12-21T00:39:29Z

All tests pass, !

On a serious note, the official tests are lacking, so this should be reviewed closely. I added some tests for the missing cases which I found, but there are more things I may have missed.

r? @mbrubeck @SimonSapin

mbrubeck · 2016-12-21T05:12:39Z

src/grapheme.rs

+                // rule GB9b: include any preceding Prepend characters
+                for (i, c) in self.string[..idx].char_indices().rev() {
+                    // TODO: Cache this to avoid repeated lookups in the common case.
+                    match gr::grapheme_category(c) {


This grapheme_category call still seems like it will be doing duplicate work. The caching I was thinking of adding was in saving the result of this call on the last time through this loop, so that grapheme_category doesn't run multiple times on the same char.

mbrubeck · 2016-12-21T05:29:17Z

src/grapheme.rs

+                for (i, c) in self.string[..idx].char_indices().rev() {
+                    // TODO: Cache this to avoid repeated lookups in the common case.
+                    match gr::grapheme_category(c) {
+                        gr::GC_Prepend => idx = i,


Hmm, this needs to either invalidate or update self.catb.

This also means we are missing a test case for this code, specifically for the case where multiple adjacent Prepend characters come between two "normal" characters.

mbrubeck · 2016-12-21T05:41:17Z

src/word.rs

@@ -80,9 +80,9 @@ enum UWordBoundsState {
    Numeric,
    Katakana,
    ExtendNumLet,
-    Regional,
+    Regional(/* half */ bool),


Do you think it would be more readable to split these into separate enum variants, { Regional, RegionalHalf, Zwj, ZwjTainted }?

mbrubeck · 2016-12-21T05:46:12Z

Added a fix and test for the Prepend caching, and some stylistic changes.

Manishearth · 2016-12-21T06:38:28Z

well, r=me on your code, I went through that pretty thoroughly when figuring out what was missing

kwantam · 2016-12-21T06:49:39Z

Hey @Manishearth, @mbrubeck:

I know I've been entirely an absentee on this, so first: thank you both for the flurry of work here!

I have full faith in both of you and your efforts here, but I don't want to (appear to) entirely shirk my responsibility (though I will admit that I may be well past avoiding that appearance...), so: would I offend either of you if I took a very quick look over this tomorrow before merging?

Once again: thank you!!!

Manishearth · 2016-12-21T06:50:44Z

Sounds fine by me!

Like I said, the official tests are incomplete (especially the word break ones) so it would be good to have more review.

cmyr · 2016-12-21T20:54:00Z

hi all,

I've run this branch against the tests mentioned in #8 and my single ZWJ case is still failing. I would be happy to help write more test cases around RI symbols and emoji ZWJ stuff if it would be helpful?

Manishearth · 2016-12-21T21:05:37Z

Ugh, WB4 raises its ugly head again.

Manishearth · 2016-12-21T22:05:15Z

Fixed. The spec is a bit unclear on the exact greediness of WB4. In particular, it's not clear whether or not sequences like Any Format Format ZWJ Extend EBG should be treated as Any(..)÷EBG (and thus broken after the Extend, or first treated as Any Format Format ZWJ(..)×EBG which becomes Any(..)ZWJ(..)×EBG, becoming Any(..)×EBG.

In the current implementation WB4 is not greedy and will let WB3c create any no_boundaries it wants first, so complicated format/extend sequences containing zwj are equivalent to ZWJ in the context of WB3c

kwantam · 2016-12-21T22:20:26Z

Is it worth checking whether other implementations make the same decision regarding greediness? Who else has this implemented? Perhaps libiconv?

Manishearth · 2016-12-21T22:28:30Z

So.. I realized that we also fail https://github.com/cmyr/rust-unicode-segmentation-tests/blob/master/src/lib.rs#L18, because we treat the space as its own word.

And the spec says nothing about spaces. Spaces are a regular Any character. Logically, it makes sense to drop spaces, but the tests contain these two cases:

÷ 0020 × 200D ÷ 0646 ÷	#  ÷ [0.2] SPACE (Other) × [4.0] ZERO WIDTH JOINER (ZWJ_FE) ÷ [999.0] ARABIC LETTER NOON (ALetter) ÷ [0.3]
÷ 0646 × 200D ÷ 0020 ÷	#  ÷ [0.2] ARABIC LETTER NOON (ALetter) × [4.0] ZERO WIDTH JOINER (ZWJ_FE) ÷ [999.0] SPACE (Other) ÷ [0.3]

which means that spaces still need to join with ZWJ, so the rules aren't that simple.

I'm pushing up the tests.

Manishearth · 2016-12-21T22:30:02Z

I plan to open a dialogue with the spec authors about:

Contributing back tests
Greediness of WB4
Spaces
Adding better grapheme break rules for Indic consonant clusters

so I'd prefer to merge this based on the current spec, hash out clarifications, and later fix any issues.

mbrubeck · 2016-12-21T22:34:01Z

In the current implementation WB4 is not greedy and will let WB3c create any no_boundaries it wants first

The rules are specified as applying in order, so this is correct: WB3c should match before WB4 is applied.

Manishearth · 2016-12-21T22:38:36Z

No, that's the point, that's not clear at all. WB3c does not apply to Any Format Format ZWJ Extend EBG. So you move on to WB4.

WB4 can apply to that string in multiple ways. It can collapse it into Any(...)EBG (ellipses mean collapsed format/extends as per WB4), Any Format Format ZWJ(...)EBG, or something else.

The problem is that WB4 makes this algorithm loop in on itself. Suddenly, the greediness of WB4 matters because it affects WB3c in the next iteration after an equivalence has been collapsed.

Manishearth · 2016-12-21T22:47:17Z

Okay, the spaces thing isn't an issue -- I confused "Word boundaries" and "extracted words" in http://www.unicode.org/reports/tr29/proposed.html#Word_Boundaries

We probably should expose an API for the latter if we don't already. So the tests as-committed are fine.

Manishearth · 2016-12-21T23:42:18Z

(Sent the email, focused on WB4. The spaces thing isn't a problem, and I'll discuss the indic thing on a different list)

kwantam · 2016-12-22T04:43:22Z

Question: is this ready to be merged or are we waiting on answers to questions?

Manishearth · 2016-12-22T04:45:01Z

Ready to be merged.

kwantam · 2016-12-22T04:51:27Z

src/test.rs

+        ("\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}",  &["\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}"]),
+        ("😌👎🏼",  &["😌", "👎🏼"]),
+        // perhaps wrong, spaces should not be included?
+        ("hello world", &["hello", " ", "world"]),


Will the answer to @Manishearth's question to the Unicode folks answer this question? If not, how can I help us resolve this question?

No, this is independent of the spec. The spec tells us that the boundaries in hello world are hello| | world We implement that. However, when we think of words we ignore spaces, so a "word iterator" would give hello|world. The question is what API we should expose.

It's probably better to have a notion of word iterator built on top of the regular word boundary iterator. This can be handled separately.

kwantam · 2016-12-22T05:06:35Z

@Manishearth just the one question above.

Also: thoughts on version numbering? The Rust API has no breaking change, but we might stretch our notion of API to include "Unicode version," in which case it would arguably be reasonable to increment the major number.

Finally: thoughts on squashing vs. keeping separate commits?

(@mbrubeck please chime in too, of course)

Manishearth · 2016-12-22T05:10:17Z

semver is not just about compile-breaking changes, semantic breaking changes can be counted too. I would do a major version bump. Keeping separate commits is fine IMO.

kwantam · 2016-12-22T05:23:03Z

@mbrubeck @Manishearth @cmyr thanks for your hard work on this 👍

cmyr · 2016-12-22T15:24:01Z

thanks @Manishearth @kwantam @mbrubeck, I'll play around with this today and see if I can find any more weird ZWJ behaviour.

Manishearth · 2016-12-23T03:21:29Z

So my interpretation of WB4 was wrong. We have to apply it in order of precedence. Thus, we cannot transform the input before attempting to apply WB3 (and the previous rules). Only then do we transform, and at that point the previous rules can no longer be applied again.

kwantam · 2016-12-23T22:58:02Z

Clarification: does that mean the current functionality is wrong? If so I can open an issue.

Manishearth · 2016-12-23T22:59:43Z

Yes, it does. Fixing this can get complicated.

kwantam · 2016-12-23T23:03:20Z

Thanks. Opened #11 for this. I'll try to find some time to think about it.

mbrubeck added 6 commits December 12, 2016 16:07

Remove unnecessary and unstable SliceExt import

5397fa4

Update data to Unicode 9.0.0

bcfd39c

Implement new rules GB9/GB11 for zero-width joiner (ZWJ)

74ea683

Implement rule GB10 (Emoji modifier sequences)

5568777

Implement new GB12/GB13 rules for RI sequences

e3754bc

Implement new rule GB9b (Do not break after Prepend)

bc121b5

mbrubeck mentioned this pull request Dec 13, 2016

problems with emoji word/grapheme segmentation #8

Closed

mbrubeck force-pushed the unicode-9-wip branch from 5d0a61d to 624aba4 Compare December 13, 2016 17:01

Manishearth force-pushed the unicode-9-wip branch 4 times, most recently from 49f312b to 4849b31 Compare December 21, 2016 00:29

mbrubeck commented Dec 21, 2016

View reviewed changes

Manishearth and others added 11 commits December 20, 2016 22:34

Cache flag indicators during reverse iteration

7c320b5

Cache prepend lookbehinds

db6e78f

Handle ZWJ in rules WB3c and WB4

f9f7076

WB14 is renamed to WB999 in Unicode 9.0.0

675f347

Implement new rule WB14 (Emoji modifier sequences)

c80e5a3

Fix forward word boundary iteration

858d594

Support flags in forward iteration

f3ea31d

Reverse word iteration -- get regional indicators working

605dc7d

Fix Emoji state in reverse iteration for words

41b11e6

Fix same emoji bug for forward iteration. All tests pass.

6ff3993

Add some extra tests

8e06fc9

mbrubeck force-pushed the unicode-9-wip branch from 38aa317 to 8e06fc9 Compare December 21, 2016 06:35

mbrubeck changed the title ~~[DO NOT MERGE] Update to Unicode 9.0.0~~ Update to Unicode 9.0.0 Dec 21, 2016

Fix precedence of WB3c and ZWJ handling

731d346

Add @cmyr's tests

8bac7c7

Manishearth force-pushed the unicode-9-wip branch from e318038 to 8bac7c7 Compare December 21, 2016 22:28

kwantam reviewed Dec 22, 2016

View reviewed changes

kwantam merged commit 8bac7c7 into unicode-rs:master Dec 22, 2016

Manishearth deleted the unicode-9-wip branch December 22, 2016 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to Unicode 9.0.0 #10

Update to Unicode 9.0.0 #10

mbrubeck commented Dec 13, 2016

Manishearth commented Dec 20, 2016 •

edited

Manishearth commented Dec 20, 2016

Manishearth commented Dec 20, 2016

Manishearth commented Dec 21, 2016

mbrubeck Dec 21, 2016

mbrubeck Dec 21, 2016

mbrubeck Dec 21, 2016

mbrubeck commented Dec 21, 2016

Manishearth commented Dec 21, 2016

kwantam commented Dec 21, 2016 •

edited

Manishearth commented Dec 21, 2016

cmyr commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016

kwantam commented Dec 21, 2016

Manishearth commented Dec 21, 2016 •

edited

Manishearth commented Dec 21, 2016

mbrubeck commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016 •

edited

kwantam commented Dec 22, 2016

Manishearth commented Dec 22, 2016

kwantam Dec 22, 2016

Manishearth Dec 22, 2016

kwantam commented Dec 22, 2016 •

edited

Manishearth commented Dec 22, 2016

kwantam commented Dec 22, 2016

cmyr commented Dec 22, 2016

Manishearth commented Dec 23, 2016

kwantam commented Dec 23, 2016 •

edited

Manishearth commented Dec 23, 2016

kwantam commented Dec 23, 2016

Update to Unicode 9.0.0 #10

Update to Unicode 9.0.0 #10

Conversation

mbrubeck commented Dec 13, 2016

Manishearth commented Dec 20, 2016 • edited

Manishearth commented Dec 20, 2016

Manishearth commented Dec 20, 2016

Manishearth commented Dec 21, 2016

mbrubeck Dec 21, 2016

Choose a reason for hiding this comment

mbrubeck Dec 21, 2016

Choose a reason for hiding this comment

mbrubeck Dec 21, 2016

Choose a reason for hiding this comment

mbrubeck commented Dec 21, 2016

Manishearth commented Dec 21, 2016

kwantam commented Dec 21, 2016 • edited

Manishearth commented Dec 21, 2016

cmyr commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016

kwantam commented Dec 21, 2016

Manishearth commented Dec 21, 2016 • edited

Manishearth commented Dec 21, 2016

mbrubeck commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016

Manishearth commented Dec 21, 2016 • edited

kwantam commented Dec 22, 2016

Manishearth commented Dec 22, 2016

kwantam Dec 22, 2016

Choose a reason for hiding this comment

Manishearth Dec 22, 2016

Choose a reason for hiding this comment

kwantam commented Dec 22, 2016 • edited

Manishearth commented Dec 22, 2016

kwantam commented Dec 22, 2016

cmyr commented Dec 22, 2016

Manishearth commented Dec 23, 2016

kwantam commented Dec 23, 2016 • edited

Manishearth commented Dec 23, 2016

kwantam commented Dec 23, 2016

Manishearth commented Dec 20, 2016 •

edited

kwantam commented Dec 21, 2016 •

edited

Manishearth commented Dec 21, 2016 •

edited

Manishearth commented Dec 21, 2016 •

edited

kwantam commented Dec 22, 2016 •

edited

kwantam commented Dec 23, 2016 •

edited