Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Script into built-in rules. #751

Merged
merged 6 commits into from Dec 23, 2022

Conversation

huacnlee
Copy link
Member

@huacnlee huacnlee commented Dec 22, 2022

Make this change to add Unicode (Script) to built-in rules.

https://unicode.org/standard/supported.html

Built-in rules is generated by the data from unicode.org, so we can give a correctly rule.

So we can easy to use them without care about the unicode range.

for example:

  • Chinese - HAN
  • Japanese - KATAKANA, HIRAGANA
  • Korean - HANGUL
  • Persian - ARABIC

Usage example

real_name = { HAN{2,5} }
address = { (HAN | ASCII_ALPHANUMERIC | PUNCTUATION)+ }

All addition rules from Unicode (Script)

ADLAM
AHOM
ANATOLIAN_HIEROGLYPHS
ARABIC
ARMENIAN
AVESTAN
BALINESE
BAMUM
BASSA_VAH
BATAK
BENGALI
BHAIKSUKI
BOPOMOFO
BRAHMI
BRAILLE
BUGINESE
BUHID
CANADIAN_ABORIGINAL
CARIAN
CAUCASIAN_ALBANIAN
CHAKMA
CHAM
CHEROKEE
CHORASMIAN
COMMON
COPTIC
CUNEIFORM
CYPRIOT
CYPRO_MINOAN
CYRILLIC
DESERET
DEVANAGARI
DIVES_AKURU
DOGRA
DUPLOYAN
EGYPTIAN_HIEROGLYPHS
ELBASAN
ELYMAIC
ETHIOPIC
GEORGIAN
GLAGOLITIC
GOTHIC
GRANTHA
GREEK
GUJARATI
GUNJALA_GONDI
GURMUKHI
HAN
HANGUL
HANIFI_ROHINGYA
HANUNOO
HATRAN
HEBREW
HIRAGANA
IMPERIAL_ARAMAIC
INHERITED
INSCRIPTIONAL_PAHLAVI
INSCRIPTIONAL_PARTHIAN
JAVANESE
KAITHI
KANNADA
KATAKANA
KAWI
KAYAH_LI
KHAROSHTHI
KHITAN_SMALL_SCRIPT
KHMER
KHOJKI
KHUDAWADI
LAO
LATIN
LEPCHA
LIMBU
LINEAR_A
LINEAR_B
LISU
LYCIAN
LYDIAN
MAHAJANI
MAKASAR
MALAYALAM
MANDAIC
MANICHAEAN
MARCHEN
MASARAM_GONDI
MEDEFAIDRIN
MEETEI_MAYEK
MENDE_KIKAKUI
MEROITIC_CURSIVE
MEROITIC_HIEROGLYPHS
MIAO
MODI
MONGOLIAN
MRO
MULTANI
MYANMAR
NABATAEAN
NAG_MUNDARI
NANDINAGARI
NEW_TAI_LUE
NEWA
NKO
NUSHU
NYIAKENG_PUACHUE_HMONG
OGHAM
OL_CHIKI
OLD_HUNGARIAN
OLD_ITALIC
OLD_NORTH_ARABIAN
OLD_PERMIC
OLD_PERSIAN
OLD_SOGDIAN
OLD_SOUTH_ARABIAN
OLD_TURKIC
OLD_UYGHUR
ORIYA
OSAGE
OSMANYA
PAHAWH_HMONG
PALMYRENE
PAU_CIN_HAU
PHAGS_PA
PHOENICIAN
PSALTER_PAHLAVI
REJANG
RUNIC
SAMARITAN
SAURASHTRA
SHARADA
SHAVIAN
SIDDHAM
SIGNWRITING
SINHALA
SOGDIAN
SORA_SOMPENG
SOYOMBO
SUNDANESE
SYLOTI_NAGRI
SYRIAC
TAGALOG
TAGBANWA
TAI_LE
TAI_THAM
TAI_VIET
TAKRI
TAMIL
TANGSA
TANGUT
TELUGU
THAANA
THAI
TIBETAN
TIFINAGH
TIRHUTA
TOTO
UGARITIC
VAI
VITHKUQI
WANCHO
WARANG_CITI
YEZIDI
YI
ZANABAZAR_SQUARE

Make this change to add `CJK`, `HAN`, `HANGUL`, `KATAKANA`, `HIRAGANA` to built-in rules.

https://unicode.org/faq/han_cjk.html

- Chinese - `HAN`
- Japanese - `KATAKANA`, `HIRAGANA`
- Korean - `HANGUL`

So we can easy to to match the CJK chars.
@huacnlee huacnlee requested a review from a team as a code owner December 22, 2022 13:20
@huacnlee huacnlee requested review from NoahTheDuke and removed request for a team December 22, 2022 13:20
…roperty_names.

-  will generate property names by use macro.
-  has been removed.
@huacnlee huacnlee changed the title Add CJK unicode into built-in rules. Add Unicode Script into built-in rules. Dec 22, 2022
… BY_NAME values by `ucd-generate` generated.

And export all property names from Unicode (Script).
Copy link
Contributor

@tomtau tomtau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, thanks @huacnlee ! Just one observation about the static item that was public, but ok to merge it.
Do you also plan to update the docs: https://github.com/pest-parser/book/blob/master/src/grammars/built-ins.md#unicode-rules ?

meta/src/lib.rs Show resolved Hide resolved
@huacnlee
Copy link
Member Author

Pest book updated:

pest-parser/book#27

pest/src/unicode/mod.rs Outdated Show resolved Hide resolved
@tomtau tomtau merged commit 25ba0a2 into pest-parser:master Dec 23, 2022
@huacnlee huacnlee deleted the feat/built-in-CJK branch December 23, 2022 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants