Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename enums for General_Category #1355

Merged
merged 10 commits into from
Dec 14, 2021
Merged

Rename enums for General_Category #1355

merged 10 commits into from
Dec 14, 2021

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Dec 3, 2021

Renames enums for General_Category and adjusts APIs accordingly.

Fixes #1296

@echeran echeran requested a review from a team as a code owner December 3, 2021 02:51
sffc
sffc previously approved these changes Dec 6, 2021
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one naming suggestion. I'd like @iainireland to review the "Bifurcate gc property values" commit a15c6f4.

components/properties/src/error.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@iainireland iainireland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to changing the names, but changing GeneralCategoryGroup so that it can no longer represent individual general categories seems like a step backwards to me in terms of ergonomics. Do we have a written justification for that change?

Comment on lines 1 to 5
# This file is part of ICU4X. For terms of use, please see the file
# called LICENSE at the top level of the ICU4X source tree
# (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

max_width = 200 # length of line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this necessary? In my experience rustfmt does a better job of breaking lines in a readable way than, say, github diffs or fixed-width editor windows, so all else being equal I think we should avoid messing with the line length settings more than necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed this file and reverted to default formatting. I thought there was an API with an enum to data key mapping where the names became long enough for the formatter to turn 1 line into 3. Maybe the renaming helped (?)... currently only GeneralCategoryGroup::ConnectorPuncutation => ... is the exception.

components/properties/src/error.rs Outdated Show resolved Hide resolved
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`].
/// Enumerated property General_Category.
///
/// General_Category specifies the most general classification of a code point, usually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it might be confusing to say "the most general classification of a code point" here when (as mentioned in the next paragraph) this doesn't include things like "Letter" and "Number". Maybe something like:

"General_Category partitions code points into a set of mutually exclusive categories, usually determined based on the primary characteristic of the assigned number. For example, UppercaseLetter and InitialPunctuation are general categories. For grouped categories like Letter or Punctuation, use [GeneralCategoryGroup]."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here. What Elango has here is straight from the horse's mouth: https://www.unicode.org/reports/tr44/#General_Category_Values
“The General_Category property of a code point provides for the most general classification of that code point.”

A single code point never maps to a group/grouping with more than one element.

components/properties/src/props.rs Show resolved Hide resolved
/// determined based on the primary characteristic of the assigned character. For example, is the
/// character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type?
/// Instances of `GeneralCategoryGroup` represent the defined multi-category
/// values that are useful for users in certain contexts, such as regex. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: my intuition is that GeneralCategoryGroup is going to be the more useful representation for many/most use cases.

I'm not sure it's helpful to give people the impression that it's specialized for "certain contexts, such as regex". In my head, the distinction is more that we expect people to query ICU4X using GeneralCategoryGroup (because that lets them ask for useful categories like "Letter"), and then return the results in terms of GeneralCategory (because that is most specific).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental property is the General_Category value (like Lt), and this is what is stored and returned in/by low-level structures, but it should be easy to get both. ICU also defines "properties" for both versions. Getting the group value from the single value is just a bit shift.

I don't see anything wrong with this text here though.

components/properties/src/props.rs Outdated Show resolved Hide resolved
components/properties/src/error.rs Outdated Show resolved Hide resolved
components/properties/src/error.rs Outdated Show resolved Hide resolved
components/properties/src/props.rs Outdated Show resolved Hide resolved
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`].
/// Enumerated property General_Category.
///
/// General_Category specifies the most general classification of a code point, usually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here. What Elango has here is straight from the horse's mouth: https://www.unicode.org/reports/tr44/#General_Category_Values
“The General_Category property of a code point provides for the most general classification of that code point.”

A single code point never maps to a group/grouping with more than one element.

/// determined based on the primary characteristic of the assigned character. For example, is the
/// character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type?
/// Instances of `GeneralCategoryGroup` represent the defined multi-category
/// values that are useful for users in certain contexts, such as regex. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental property is the General_Category value (like Lt), and this is what is stored and returned in/by low-level structures, but it should be easy to get both. ICU also defines "properties" for both versions. Getting the group value from the single value is just a bit shift.

I don't see anything wrong with this text here though.

components/properties/src/props.rs Outdated Show resolved Hide resolved
components/properties/src/sets.rs Show resolved Hide resolved
components/properties/src/maps.rs Outdated Show resolved Hide resolved
markusicu
markusicu previously approved these changes Dec 10, 2021
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

iainireland
iainireland previously approved these changes Dec 13, 2021
Copy link
Contributor

@iainireland iainireland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@echeran echeran dismissed stale reviews from iainireland and markusicu via a38628e December 14, 2021 18:43
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-rebase rslgtm

@codecov-commenter
Copy link

Codecov Report

Merging #1355 (a38628e) into main (b081f84) will not change coverage.
The diff coverage is 53.84%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1355   +/-   ##
=======================================
  Coverage   76.66%   76.66%           
=======================================
  Files         291      291           
  Lines       16652    16652           
=======================================
  Hits        12766    12766           
  Misses       3886     3886           
Impacted Files Coverage Δ
components/properties/src/error.rs 0.00% <0.00%> (ø)
components/properties/src/maps.rs 0.00% <ø> (ø)
components/properties/src/sets.rs 8.79% <ø> (ø)
provider/uprops/src/enum_uniset.rs 94.21% <ø> (ø)
components/properties/src/props.rs 18.75% <25.00%> (ø)
components/properties/src/ule.rs 26.00% <33.33%> (ø)
components/properties/src/trievalue.rs 25.00% <100.00%> (ø)
provider/uprops/src/enum_codepointtrie.rs 81.53% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b081f84...a38628e. Read the comment docs.

@echeran echeran merged commit bea61e5 into unicode-org:main Dec 14, 2021
@echeran echeran deleted the gc-naming branch June 8, 2023 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naming for GeneralCategory
5 participants