Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Unicode 14 (to support Vithkuqi) #877

Closed
pemistahl opened this issue Jul 3, 2022 · 5 comments · Fixed by #879
Closed

Upgrade to Unicode 14 (to support Vithkuqi) #877

pemistahl opened this issue Jul 3, 2022 · 5 comments · Fixed by #879

Comments

@pemistahl
Copy link

What version of regex are you using?

1.5.6

Describe the bug at a high level.

The letters of the Vithkuqi script, a script for writing the Albanian language, were added to Unicode version 14.0. The respective Unicode block is from U+10570 to U+105BF. I discovered that the regex \w+ does not match the letters of this block. Additionally, case-insensitive regexes starting with (?i) do not match both Vithkuqi uppercase and lowercase letters.

What are the steps to reproduce the behavior?

use regex::Regex;

let upper = "\u{10570}";          // Vithkuqi Capital Letter A
let lower = upper.to_lowercase(); // Vithkuqi Small Letter A (U+10597)

let r1 = Regex::new("(?i)^\u{10570}$").unwrap();
let r2 = Regex::new("^\\w+$").unwrap();

println!("{}", r1.is_match(upper));
println!("{}", r1.is_match(&lower));
println!("{}", r2.is_match(upper));
println!("{}", r2.is_match(&lower));

What is the actual behavior?

The actual output is:

true
false
false
false

What is the expected behavior?

The expected output is:

true
true
true
true
@BurntSushi
Copy link
Member

If this was really added in Unicode 14, then this makes sense since I don't believe the regex crate has had its Unicode tables updated to 14 yet. They're still on 13. No real reason for it. Just hasn't been done yet.

@BurntSushi
Copy link
Member

Also, FWIW, the regex crate does not expose any direct way to access Unicode blocks. But if Unicode's definition of word changed and new casing rules were added, then those will get automatically pulled in by updating to 14.

@pemistahl
Copy link
Author

I see, thanks for your quick reply. Unicode 14.0 was already released in September 2021, so I think it would be a good idea to update the regex crate accordingly.

BurntSushi added a commit that referenced this issue Jul 5, 2022
Vithkuqi support was added to Unicode 14.

Fixes #877
BurntSushi added a commit that referenced this issue Jul 5, 2022
Vithkuqi support was added to Unicode 14.

Fixes #877
BurntSushi added a commit that referenced this issue Jul 5, 2022
Vithkuqi support was added to Unicode 14.

Fixes #877
@BurntSushi BurntSushi changed the title Vithkuqi Unicode block is not correctly supported Upgrade to Unicode 14 (to support Vithkuqi) Jul 5, 2022
@BurntSushi
Copy link
Member

This has been added in regex 1.6.0 on crates.io.

@pemistahl
Copy link
Author

Awesome @BurntSushi. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants