Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Unicode 9.0.0 #10

Merged
merged 19 commits into from
Dec 22, 2016
Merged

Update to Unicode 9.0.0 #10

merged 19 commits into from
Dec 22, 2016

Conversation

mbrubeck
Copy link
Contributor

This is a work-in-progress update to Unicode 9.0.0. I'm submitting it here in case someone else wants to work on it, because I'm not sure whether I'll have time to finish it soon.

The grapheme cluster changes are all implemented and the tests are all passing, but there are some new inefficiencies in reverse iteration over grapheme clusters. See the "TODO" comments for details.

The word boundary changes are mostly not implemented yet, so some word boundary tests are failing.

@Manishearth
Copy link
Member

Manishearth commented Dec 20, 2016

I tried an optimization for the regional indicators in Manishearth@2297574 . Does that make sense, or did you mean something else?

The state doesn't persist when you jump between forward and reverse iteration because that sounds like a rare use case and I don't want to add extra overhead to the forward iteration case.

@Manishearth
Copy link
Member

Added support for the RI stuff too. The tests don't seem to handle flags at all? This was completely missed by the tests, which isn't good. Can we contribute some tests? Simple test is that 🇦🇫🇦🇽🇦🇱🇩🇿🇦🇸🇦🇩🇦🇴 should decompose into individual flags when word-sliced.

@Manishearth
Copy link
Member

Yeah, none of their tests contain more than two regional indicators :|

@Manishearth Manishearth force-pushed the unicode-9-wip branch 4 times, most recently from 49f312b to 4849b31 Compare December 21, 2016 00:29
@Manishearth
Copy link
Member

All tests pass, :shipit:!

On a serious note, the official tests are lacking, so this should be reviewed closely. I added some tests for the missing cases which I found, but there are more things I may have missed.

r? @mbrubeck @SimonSapin

// rule GB9b: include any preceding Prepend characters
for (i, c) in self.string[..idx].char_indices().rev() {
// TODO: Cache this to avoid repeated lookups in the common case.
match gr::grapheme_category(c) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This grapheme_category call still seems like it will be doing duplicate work. The caching I was thinking of adding was in saving the result of this call on the last time through this loop, so that grapheme_category doesn't run multiple times on the same char.

for (i, c) in self.string[..idx].char_indices().rev() {
// TODO: Cache this to avoid repeated lookups in the common case.
match gr::grapheme_category(c) {
gr::GC_Prepend => idx = i,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this needs to either invalidate or update self.catb.

This also means we are missing a test case for this code, specifically for the case where multiple adjacent Prepend characters come between two "normal" characters.

@@ -80,9 +80,9 @@ enum UWordBoundsState {
Numeric,
Katakana,
ExtendNumLet,
Regional,
Regional(/* half */ bool),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be more readable to split these into separate enum variants, { Regional, RegionalHalf, Zwj, ZwjTainted }?

@mbrubeck
Copy link
Contributor Author

Added a fix and test for the Prepend caching, and some stylistic changes.

@mbrubeck mbrubeck changed the title [DO NOT MERGE] Update to Unicode 9.0.0 Update to Unicode 9.0.0 Dec 21, 2016
@Manishearth
Copy link
Member

well, r=me on your code, I went through that pretty thoroughly when figuring out what was missing

@kwantam
Copy link
Member

kwantam commented Dec 21, 2016

Hey @Manishearth, @mbrubeck:

I know I've been entirely an absentee on this, so first: thank you both for the flurry of work here!

I have full faith in both of you and your efforts here, but I don't want to (appear to) entirely shirk my responsibility (though I will admit that I may be well past avoiding that appearance...), so: would I offend either of you if I took a very quick look over this tomorrow before merging?

Once again: thank you!!!

@Manishearth
Copy link
Member

Sounds fine by me!

Like I said, the official tests are incomplete (especially the word break ones) so it would be good to have more review.

@cmyr
Copy link

cmyr commented Dec 21, 2016

hi all,

I've run this branch against the tests mentioned in #8 and my single ZWJ case is still failing. I would be happy to help write more test cases around RI symbols and emoji ZWJ stuff if it would be helpful?

@Manishearth
Copy link
Member

Ugh, WB4 raises its ugly head again.

@Manishearth
Copy link
Member

Fixed. The spec is a bit unclear on the exact greediness of WB4. In particular, it's not clear whether or not sequences like Any Format Format ZWJ Extend EBG should be treated as Any(..)÷EBG (and thus broken after the Extend, or first treated as Any Format Format ZWJ(..)×EBG which becomes Any(..)ZWJ(..)×EBG, becoming Any(..)×EBG.

In the current implementation WB4 is not greedy and will let WB3c create any no_boundaries it wants first, so complicated format/extend sequences containing zwj are equivalent to ZWJ in the context of WB3c

@kwantam
Copy link
Member

kwantam commented Dec 21, 2016

Is it worth checking whether other implementations make the same decision regarding greediness? Who else has this implemented? Perhaps libiconv?

@Manishearth
Copy link
Member

Manishearth commented Dec 21, 2016

So.. I realized that we also fail https://github.com/cmyr/rust-unicode-segmentation-tests/blob/master/src/lib.rs#L18, because we treat the space as its own word.

And the spec says nothing about spaces. Spaces are a regular Any character. Logically, it makes sense to drop spaces, but the tests contain these two cases:

÷ 0020 × 200D ÷ 0646 ÷	#  ÷ [0.2] SPACE (Other) × [4.0] ZERO WIDTH JOINER (ZWJ_FE) ÷ [999.0] ARABIC LETTER NOON (ALetter) ÷ [0.3]
÷ 0646 × 200D ÷ 0020 ÷	#  ÷ [0.2] ARABIC LETTER NOON (ALetter) × [4.0] ZERO WIDTH JOINER (ZWJ_FE) ÷ [999.0] SPACE (Other) ÷ [0.3]

which means that spaces still need to join with ZWJ, so the rules aren't that simple.

I'm pushing up the tests.

@Manishearth
Copy link
Member

I plan to open a dialogue with the spec authors about:

  • Contributing back tests
  • Greediness of WB4
  • Spaces
  • Adding better grapheme break rules for Indic consonant clusters

so I'd prefer to merge this based on the current spec, hash out clarifications, and later fix any issues.

@mbrubeck
Copy link
Contributor Author

In the current implementation WB4 is not greedy and will let WB3c create any no_boundaries it wants first

The rules are specified as applying in order, so this is correct: WB3c should match before WB4 is applied.

@Manishearth
Copy link
Member

No, that's the point, that's not clear at all. WB3c does not apply to Any Format Format ZWJ Extend EBG. So you move on to WB4.

WB4 can apply to that string in multiple ways. It can collapse it into Any(...)EBG (ellipses mean collapsed format/extends as per WB4), Any Format Format ZWJ(...)EBG, or something else.

The problem is that WB4 makes this algorithm loop in on itself. Suddenly, the greediness of WB4 matters because it affects WB3c in the next iteration after an equivalence has been collapsed.

@Manishearth
Copy link
Member

Okay, the spaces thing isn't an issue -- I confused "Word boundaries" and "extracted words" in http://www.unicode.org/reports/tr29/proposed.html#Word_Boundaries

We probably should expose an API for the latter if we don't already. So the tests as-committed are fine.

@Manishearth
Copy link
Member

Manishearth commented Dec 21, 2016

(Sent the email, focused on WB4. The spaces thing isn't a problem, and I'll discuss the indic thing on a different list)

@kwantam
Copy link
Member

kwantam commented Dec 22, 2016

Question: is this ready to be merged or are we waiting on answers to questions?

@Manishearth
Copy link
Member

Ready to be merged.

("\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}", &["\u{1f468}\u{200d}\u{1f468}\u{200d}\u{1f466}"]),
("😌👎🏼", &["😌", "👎🏼"]),
// perhaps wrong, spaces should not be included?
("hello world", &["hello", " ", "world"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the answer to @Manishearth's question to the Unicode folks answer this question? If not, how can I help us resolve this question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is independent of the spec. The spec tells us that the boundaries in hello world are hello| | world We implement that. However, when we think of words we ignore spaces, so a "word iterator" would give hello|world. The question is what API we should expose.

It's probably better to have a notion of word iterator built on top of the regular word boundary iterator. This can be handled separately.

@kwantam
Copy link
Member

kwantam commented Dec 22, 2016

@Manishearth just the one question above.

Also: thoughts on version numbering? The Rust API has no breaking change, but we might stretch our notion of API to include "Unicode version," in which case it would arguably be reasonable to increment the major number.

Finally: thoughts on squashing vs. keeping separate commits?

(@mbrubeck please chime in too, of course)

@Manishearth
Copy link
Member

semver is not just about compile-breaking changes, semantic breaking changes can be counted too. I would do a major version bump. Keeping separate commits is fine IMO.

@kwantam kwantam merged commit 8bac7c7 into unicode-rs:master Dec 22, 2016
@kwantam
Copy link
Member

kwantam commented Dec 22, 2016

@mbrubeck @Manishearth @cmyr thanks for your hard work on this 👍

@Manishearth Manishearth deleted the unicode-9-wip branch December 22, 2016 05:32
@cmyr
Copy link

cmyr commented Dec 22, 2016

thanks @Manishearth @kwantam @mbrubeck, I'll play around with this today and see if I can find any more weird ZWJ behaviour.

@Manishearth
Copy link
Member

So my interpretation of WB4 was wrong. We have to apply it in order of precedence. Thus, we cannot transform the input before attempting to apply WB3 (and the previous rules). Only then do we transform, and at that point the previous rules can no longer be applied again.

@kwantam
Copy link
Member

kwantam commented Dec 23, 2016

Clarification: does that mean the current functionality is wrong? If so I can open an issue.

@Manishearth
Copy link
Member

Yes, it does. Fixing this can get complicated.

@kwantam
Copy link
Member

kwantam commented Dec 23, 2016

Thanks. Opened #11 for this. I'll try to find some time to think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants