-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Grapheme_Cluster_Break, Word_Break, and Sentence_Break Unicode properties #1233
Conversation
…ode properties required by UAX 29
The TOML file was obtained from Azure artifact archive from commit 2921a81ee4c67459ff455e31c599e7d7a09086ab titled "ICU-21811 TZ update 2021a (2021e)" on maint/maint-70 branch. This commit imports TrieType::Small flavor of the uprops files.
Some thoughts:
|
components/properties/src/props.rs
Outdated
pub const RegionalIndicator: GraphemeClusterBreak = GraphemeClusterBreak(12); // name="RI" | ||
pub const ZWJ: GraphemeClusterBreak = GraphemeClusterBreak(17); // name="ZWJ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13 .. 16 are marked as unused in the spec, and no v=13
.. v=16
in the TOML, so I left them undefined. Should I add them for completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. ICU of course has constants for these values because at some point some characters had them, the constants are stable API, and we also have API that lets people map between value names and enum constant values.
You don't need to support data from old Unicode versions, so maybe you don't need now-unused values.
However, another question is how \p{gcb=E_Base}
would be supported. It should yield an empty set rather than an unknown-value error. Is it easier to do so if unused values are defined?
So... my hunch is that it's cleaner / more user-friendly to define all values, even ones that currently aren't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is this set of 18 values stable, or will it grow over time?
If the set is stable, I think we should use a numeric enum rather than a newtype. (If the set will grow, I think the newtype is correct.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... my hunch is that it's cleaner / more user-friendly to define all values, even ones that currently aren't used.
Makes sense. I add them to avoid the confusion of the gaps in the enum values.
Question: is this set of 18 values stable, or will it grow over time?
From the spec's evolution and ICUC's doc UGraphemeClusterBreak
, the set of value can grow overtime. For example, ZWJ
and now unused value 13 .. 16 are introduced in Unicode 9, so it's hard to predict we won't ever introduce a new value.
We probably want all break properties, include LineBreak
to be added in the future, in one format so that segmenter can access them uniformly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most enumerated properties can get new values added in future versions of Unicode. The ...Break properties frequently do.
components/properties/src/props.rs
Outdated
pub const DoubleQuote: WordBreak = WordBreak(16); // name=DQ | ||
pub const ZWJ: WordBreak = WordBreak(21); // name="ZWJ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. 17.. 20 are obsoleted and used in the spec https://www.unicode.org/reports/tr29/#Default_Word_Boundaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above w.r.t. stability (also below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply in grapheme cluster breaker. Also added the obsolete values.
@@ -27,3 +27,30 @@ impl TrieValue for Script { | |||
u16::try_from(i).map(Script) | |||
} | |||
} | |||
|
|||
impl TrieValue for GraphemeClusterBreak { | |||
const DATA_GET_ERROR_VALUE: GraphemeClusterBreak = GraphemeClusterBreak::Other; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This DATA_GET_ERROR_VALUE
is a "last resort fallback" if the data is malformed and we can't get the error value out of it. We already have a ticket #1183 to discuss whether this is the best approach.
components/properties/src/maps.rs
Outdated
/// let root_dir = icu_testdata::paths::uprops_toml_root(); | ||
/// let provider = EnumeratedPropertyCodePointTrieProvider::new(root_dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, here and elsewhere: as of Friday, you can (and should) replace this with icu_testdata::get_provider()
, which will work after you update your branch and run cargo make testdata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added necessary glue code to run cargo make testdata
, and regenerated the data.
components/properties/Cargo.toml
Outdated
@@ -42,6 +42,8 @@ zerovec = { version = "0.4", path = "../../utils/zerovec", features = ["serde"] | |||
|
|||
[dev-dependencies] | |||
icu = { path = "../../components/icu", default-features = false } | |||
icu_testdata = { version = "0.3", path = "../../provider/testdata" } | |||
icu_provider_uprops = { version = "0.3", path = "../../provider/uprops" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: As of Friday, you can (and should) remove this dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
The json and postcard files are generated via `cargo make testdata`.
…compatible with ICU. These are obsolete values, so the generated testdata shouldn't change. Running `cargo make testdata` won't change the testdata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job!
@markusicu @sffc Thank you for the review. |
…ode properties (unicode-org#1233) The obsolete enum values in GraphemeClusterBreak and WordBreak are added to retain the compatibility with ICU. The TOML file was obtained from Azure artifact archive built on unicode-org/icu, commit 2921a81ee4c67459ff455e31c599e7d7a09086ab titled "ICU-21811 TZ update 2021a (2021e)" on maint/maint-70 branch. This commit imports TrieType::Small flavor of the uprops files. The json and postcard files are generated via `cargo make testdata`.
Fixed #1214.