-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename enums for General_Category #1355
Conversation
…eralCategoryGroup)
…into GeneralCategory and GeneralCategoryGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one naming suggestion. I'd like @iainireland to review the "Bifurcate gc property values" commit a15c6f4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objection to changing the names, but changing GeneralCategoryGroup so that it can no longer represent individual general categories seems like a step backwards to me in terms of ergonomics. Do we have a written justification for that change?
# This file is part of ICU4X. For terms of use, please see the file | ||
# called LICENSE at the top level of the ICU4X source tree | ||
# (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ). | ||
|
||
max_width = 200 # length of line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this necessary? In my experience rustfmt does a better job of breaking lines in a readable way than, say, github diffs or fixed-width editor windows, so all else being equal I think we should avoid messing with the line length settings more than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed this file and reverted to default formatting. I thought there was an API with an enum to data key mapping where the names became long enough for the formatter to turn 1 line into 3. Maybe the renaming helped (?)... currently only GeneralCategoryGroup::ConnectorPuncutation => ...
is the exception.
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`]. | ||
/// Enumerated property General_Category. | ||
/// | ||
/// General_Category specifies the most general classification of a code point, usually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it might be confusing to say "the most general classification of a code point" here when (as mentioned in the next paragraph) this doesn't include things like "Letter" and "Number". Maybe something like:
"General_Category partitions code points into a set of mutually exclusive categories, usually determined based on the primary characteristic of the assigned number. For example, UppercaseLetter
and InitialPunctuation
are general categories. For grouped categories like Letter
or Punctuation
, use [GeneralCategoryGroup
]."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree here. What Elango has here is straight from the horse's mouth: https://www.unicode.org/reports/tr44/#General_Category_Values
“The General_Category property of a code point provides for the most general classification of that code point.”
A single code point never maps to a group/grouping with more than one element.
/// determined based on the primary characteristic of the assigned character. For example, is the | ||
/// character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type? | ||
/// Instances of `GeneralCategoryGroup` represent the defined multi-category | ||
/// values that are useful for users in certain contexts, such as regex. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: my intuition is that GeneralCategoryGroup is going to be the more useful representation for many/most use cases.
I'm not sure it's helpful to give people the impression that it's specialized for "certain contexts, such as regex". In my head, the distinction is more that we expect people to query ICU4X using GeneralCategoryGroup (because that lets them ask for useful categories like "Letter"), and then return the results in terms of GeneralCategory (because that is most specific).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental property is the General_Category value (like Lt), and this is what is stored and returned in/by low-level structures, but it should be easy to get both. ICU also defines "properties" for both versions. Getting the group value from the single value is just a bit shift.
I don't see anything wrong with this text here though.
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`]. | ||
/// Enumerated property General_Category. | ||
/// | ||
/// General_Category specifies the most general classification of a code point, usually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree here. What Elango has here is straight from the horse's mouth: https://www.unicode.org/reports/tr44/#General_Category_Values
“The General_Category property of a code point provides for the most general classification of that code point.”
A single code point never maps to a group/grouping with more than one element.
/// determined based on the primary characteristic of the assigned character. For example, is the | ||
/// character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type? | ||
/// Instances of `GeneralCategoryGroup` represent the defined multi-category | ||
/// values that are useful for users in certain contexts, such as regex. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental property is the General_Category value (like Lt), and this is what is stored and returned in/by low-level structures, but it should be easy to get both. ICU also defines "properties" for both versions. Getting the group value from the single value is just a bit shift.
I don't see anything wrong with this text here though.
Co-authored-by: Markus Scherer <markus.icu@gmail.com>
…ta key mapping in the properties sets API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-rebase rslgtm
Codecov Report
@@ Coverage Diff @@
## main #1355 +/- ##
=======================================
Coverage 76.66% 76.66%
=======================================
Files 291 291
Lines 16652 16652
=======================================
Hits 12766 12766
Misses 3886 3886
Continue to review full report at Codecov.
|
Renames enums for General_Category and adjusts APIs accordingly.
Fixes #1296