Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix encoding bugs and add tests to String::center #1731

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

lopopolo
Copy link
Member

Followup to #1634 and also big bugfix to String#center.

String#center is an encoding-aware API. Both the receiver and the padding need to be:

  • Encoding compatible; in practice this means the following encoding pairs are allowed:
    • (Encoding::Utf8, Encoding::Utf8)
    • (Encoding::Ascii | Encoding::Binary, Encoding::Ascii | Encoding::Binary)
  • Valid per their encoding.

This means that character counts are encoding-aware. This is in contrast to the current implementation which treats padding as raw bytes.

For example, consider the difference between MRI and Artichoke for these UTF-8 combinations:

MRI:

$ irb
[3.1.1] > "喜欢".center 5, "打球".b
(irb):1:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):1:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.1/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.1/bin/irb:25:in `<main>'
[3.1.1] > "谢谢".center 5, "好吗"
=> "好谢谢好吗"
[3.1.1] > "谢谢".center 5, "12"
=> "1谢谢12"
[3.1.1] > "a".center 5, "好吗"
=> "好吗a好吗"

Artichoke:

$ cargo run --bin airb -q
artichoke 0.1.0-pre.0 (2022-03-11 revision 5352) [x86_64-apple-darwin]
[rustc 1.58.1 (db9d1b20b 2022-01-20) on x86_64-apple-darwin]
>>> "喜欢".center 5, "打球".b
=> "喜欢\xE6"
>>> "谢谢".center 5, "好吗"
=> "谢谢\xE5"
>>> "谢谢".center 5, "12"
=> "谢谢1"
>>> "a".center 5, "好吗"
=> "\xE5a\xE5\xA5"
>>>

Additional tests and comparisons with MRI have revealed that the given padding width is the desired width of the returned string, not the number of padding bytes to add.

@lopopolo lopopolo added A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. S-wip Status: This pull request is a work in progress. C-bug Category: This is a bug. labels Mar 11, 2022
@lopopolo
Copy link
Member Author

Some more tests:

$ irb
[3.1.2] > utf8 = "abc"
=> "abc"
[3.1.2] > utf8.encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb = "你好"
=> "你好"
[3.1.2] > utf8mb.encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii = "xyz".force_encoding(Encoding::ASCII)
=> "xyz"
[3.1.2] > ascii.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > binary = "\xFF\xFE".b
=> "\xFF\xFE"
[3.1.2] > binary.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > utf8.center(10, utf8)
=> "abcabcabca"
[3.1.2] > utf8.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, utf8mb)
=> "你好你abc你好你好"
[3.1.2] > utf8.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, ascii)
=> "xyzabcxyzx"
[3.1.2] > utf8.center(10, ascii).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, binary)
=> "\xFF\xFE\xFFabc\xFF\xFE\xFF\xFE"
[3.1.2] > utf8.center(10, binary).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > utf8.center(10, utf8mb.b)
=> "\xE4\xBD\xA0abc\xE4\xBD\xA0\xE5"
[3.1.2] > utf8.center(10, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(10, utf8)
=> "abcxyzabca"
[3.1.2] > ascii.center(10, utf8).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > ascii.center(10, utf8mb)
=> "你好你xyz你好你好"
[3.1.2] > ascii.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii.center(10, ascii)
=> "xyzxyzxyzx"
[3.1.2] > ascii.center(10, ascii).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > ascii.center(10, binary)
=> "\xFF\xFE\xFFxyz\xFF\xFE\xFF\xFE"
[3.1.2] > ascii.center(10, binary).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(10, ascii.b)
=> "xyzxyzxyzx"
[3.1.2] > ascii.center(10, ascii.b).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > utf8.center(10, ascii.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii.center(11, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(12, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(13, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(14, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(16, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(15, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(15, utf8mb.b)
=> "\xE4\xBD\xA0\xE5\xA5\xBDxyz\xE4\xBD\xA0\xE5\xA5\xBD"
[3.1.2] > "喜欢".center 5, "打球".b
(irb):37:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):37:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".center 10, "打球".b
(irb):38:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):38:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".center 10, utf8mb.b
(irb):39:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):39:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8)
=> "abca你好abca"
[3.1.2] > utf8mb.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8.b)
=> "abca你好abca"
[3.1.2] > utf8mb.center(10, utf8.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8mb)
=> "你好你好你好你好你好"
[3.1.2] > utf8mb.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8mb.b)
(irb):47:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):47:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > utf8mb.center(10, ascii)
=> "xyzx你好xyzx"
[3.1.2] > utf8mb.center(10, ascii).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, ascii.b)
=> "xyzx你好xyzx"
[3.1.2] > utf8mb.center(10, ascii.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, binary)
(irb):52:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):52:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > utf8invalid = "\xFF"
=> "\xFF"
[3.1.2] > utf8invalid.encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8)
=> "abca\xFFabcab"
[3.1.2] > utf8invalid.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8mb)
=> "你好你好\xFF你好你好你"
[3.1.2] > utf8invalid.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8mb.b)
(irb):60:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from (irb):60:in `<main>'
        from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
        from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'

@lopopolo
Copy link
Member Author

this is necessary but has been stale for a while. Just pushed up a rebase to latest trunk (while there are still no conflicts 😅)

@lopopolo lopopolo self-assigned this May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. C-bug Category: This is a bug. S-wip Status: This pull request is a work in progress.
Development

Successfully merging this pull request may close these issues.

None yet

1 participant