Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Grapheme_Cluster_Break, Word_Break, and Sentence_Break Unicode properties #1233

Merged
merged 7 commits into from
Nov 2, 2021

Conversation

aethanyc
Copy link
Contributor

Fixed #1214.

The TOML file was obtained from Azure artifact archive from commit
2921a81ee4c67459ff455e31c599e7d7a09086ab titled "ICU-21811 TZ update
2021a (2021e)" on maint/maint-70 branch.

This commit imports TrieType::Small flavor of the uprops files.
@aethanyc aethanyc requested review from echeran and sffc October 28, 2021 19:05
@aethanyc aethanyc requested a review from a team as a code owner October 28, 2021 19:05
@aethanyc
Copy link
Contributor Author

Some thoughts:

  1. This PR is ready for review, but it will conflict with Connect properties provider to the icu4x_datagen exporter tool #1204 and need some tweak once 1204 merged.
  2. To test codepointtrie getter API such as get_word_break, I need to import TOML files.
  3. We need another PR to connect these properties to icu4x_datagen to convert the TOML files to a binary format to serve them in a production-ready segmenter.

Comment on lines 452 to 453
pub const RegionalIndicator: GraphemeClusterBreak = GraphemeClusterBreak(12); // name="RI"
pub const ZWJ: GraphemeClusterBreak = GraphemeClusterBreak(17); // name="ZWJ"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 .. 16 are marked as unused in the spec, and no v=13 .. v=16 in the TOML, so I left them undefined. Should I add them for completeness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. ICU of course has constants for these values because at some point some characters had them, the constants are stable API, and we also have API that lets people map between value names and enum constant values.

You don't need to support data from old Unicode versions, so maybe you don't need now-unused values.

However, another question is how \p{gcb=E_Base} would be supported. It should yield an empty set rather than an unknown-value error. Is it easier to do so if unused values are defined?

So... my hunch is that it's cleaner / more user-friendly to define all values, even ones that currently aren't used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is this set of 18 values stable, or will it grow over time?

If the set is stable, I think we should use a numeric enum rather than a newtype. (If the set will grow, I think the newtype is correct.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... my hunch is that it's cleaner / more user-friendly to define all values, even ones that currently aren't used.

Makes sense. I add them to avoid the confusion of the gaps in the enum values.

Question: is this set of 18 values stable, or will it grow over time?

From the spec's evolution and ICUC's doc UGraphemeClusterBreak, the set of value can grow overtime. For example, ZWJ and now unused value 13 .. 16 are introduced in Unicode 9, so it's hard to predict we won't ever introduce a new value.

We probably want all break properties, include LineBreak to be added in the future, in one format so that segmenter can access them uniformly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most enumerated properties can get new values added in future versions of Unicode. The ...Break properties frequently do.

Comment on lines 486 to 487
pub const DoubleQuote: WordBreak = WordBreak(16); // name=DQ
pub const ZWJ: WordBreak = WordBreak(21); // name="ZWJ"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. 17.. 20 are obsoleted and used in the spec https://www.unicode.org/reports/tr29/#Default_Word_Boundaries

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above w.r.t. stability (also below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply in grapheme cluster breaker. Also added the obsolete values.

@@ -27,3 +27,30 @@ impl TrieValue for Script {
u16::try_from(i).map(Script)
}
}

impl TrieValue for GraphemeClusterBreak {
const DATA_GET_ERROR_VALUE: GraphemeClusterBreak = GraphemeClusterBreak::Other;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I know this is pre-existing, just sayin' --

The trie itself stores an error value. It would be nice to fetch that (and map it to the return type) rather than having to hardcode it for each type.
@echeran @sffc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DATA_GET_ERROR_VALUE is a "last resort fallback" if the data is malformed and we can't get the error value out of it. We already have a ticket #1183 to discuss whether this is the best approach.

Comment on lines 74 to 75
/// let root_dir = icu_testdata::paths::uprops_toml_root();
/// let provider = EnumeratedPropertyCodePointTrieProvider::new(root_dir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, here and elsewhere: as of Friday, you can (and should) replace this with icu_testdata::get_provider(), which will work after you update your branch and run cargo make testdata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added necessary glue code to run cargo make testdata, and regenerated the data.

@@ -42,6 +42,8 @@ zerovec = { version = "0.4", path = "../../utils/zerovec", features = ["serde"]

[dev-dependencies]
icu = { path = "../../components/icu", default-features = false }
icu_testdata = { version = "0.3", path = "../../provider/testdata" }
icu_provider_uprops = { version = "0.3", path = "../../provider/uprops" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: As of Friday, you can (and should) remove this dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

components/properties/src/props.rs Show resolved Hide resolved
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice job!

@aethanyc
Copy link
Contributor Author

aethanyc commented Nov 2, 2021

@markusicu @sffc Thank you for the review.

@aethanyc aethanyc merged commit 0c855b1 into unicode-org:main Nov 2, 2021
@aethanyc aethanyc deleted the uax29-props branch November 2, 2021 17:02
hsivonen pushed a commit to hsivonen/icu4x that referenced this pull request Nov 11, 2021
…ode properties (unicode-org#1233)

The obsolete enum values in GraphemeClusterBreak and WordBreak are added
to retain the compatibility with ICU.

The TOML file was obtained from Azure artifact archive built on
unicode-org/icu, commit 2921a81ee4c67459ff455e31c599e7d7a09086ab titled
"ICU-21811 TZ update 2021a (2021e)" on maint/maint-70 branch. This
commit imports TrieType::Small flavor of the uprops files.

The json and postcard files are generated via `cargo make testdata`.
@echeran echeran mentioned this pull request Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Unicode properties required by UAX 29
3 participants