Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specs for symbols encoding preservation #211

Closed
wants to merge 1 commit into from

Conversation

casperisfine
Copy link

This PR only include failing specs because I'm unsure how it could be fixed without being a breaking change.

Ruby Symbols like Ruby Strings have an associated encoding property. The only difference is that when possible it will be "casted" to US_ASCII:

>> "foo".encoding
=> #<Encoding:UTF-8>
>> "foo".to_sym.encoding
=> #<Encoding:US-ASCII>
>> "fée".encoding
=> #<Encoding:UTF-8>
>> "fée".to_sym.encoding
=> #<Encoding:UTF-8>

The problem here is that extension types are stored as binary strings, so Symbol.from_msgpack_ext receive a String with Encoding::BINARY (AKA ASCII-8BIT), so after a roundtrip, any non-ascii symbol will be improperly restored:

>> "fée".force_encoding(Encoding::BINARY).to_sym
=> :"f\xC3\xA9e"
>> "fée".force_encoding(Encoding::BINARY).to_sym.encoding
=> #<Encoding:ASCII-8BIT>

I don't know enough about the msgpack format to figure out what could be done here, and I presume breaking backward compatibility would be an issue, so storing the encoding alongside the symbol string is likely not possible.

One hack I'm thinking about is that we could assume that non-ASCII symbols are UTF-8, e.g.:

  def self.from_msgpack_ext(data)
    unless data.ascii_only?
      data.force_encoding(Encoding::UTF_8)
    end
    data.to_sym
  end

I think that would fix the issue for the vast majority of users, but still wouldn't be quite correct for people using something else than UTF-8 as source encoding.

@tagomoris
Copy link
Member

It is true that encodings of symbols are not preserved now. Breaking compatibility is not a possible option, so, how about adding a new method pair to serialize/deserialize encoding with the content? It can provide the option to store encodings (of course, it causes the larger binary).

Anyway, the original problem is still unclear to me. The difference of Symbol's encoding is causing problems in your case?

@casperisfine
Copy link
Author

Anyway, the original problem is still unclear to me. The difference of Symbol's encoding is causing problems in your case?

Surprisingly no, it didn't. Likely because non-ASCII symbols are very rare, so it somehow works out.

But I figured this was a bug worth reporting.

Breaking compatibility is not a possible option

Agreed.

how about adding a new method pair to serialize/deserialize encoding with the content?

I'm not sure I understand what you are suggesting.

@tagomoris
Copy link
Member

But I figured this was a bug worth reporting.

👍 Indeed. It's an unexpected thing to me, and good to know.
I wrote a small code snippet around it. Symbols with different encoding are not equal to each other (symAscii != symUtf8), but those #hash results are equal. It should be a reason why we didn't pay attention to it.

I'm not sure I understand what you are suggesting.

My idea is to have lib/msgpack/symbol_alt.rb which contains a different implementation of Symbol#to_msgpack_ext and Symbol.from_msgpack_ext to serialize/deserialize encoding in addition to its content. Actually, it sounds a horrible idea (what happens when both are required?)

Anyway, if no one feels that the encoding of symbols is a serious problem, a much larger symbol representation in msgpack w/ encoding should not be welcomed by everyone.

@casperisfine
Copy link
Author

Actually, it sounds a horrible idea (what happens when both are required?)

The last one wins. We could indeed have a flag or something to change that behavior, but then what would people do if they somehow need to support both format?

I agree this is tricky. I'm not even sure what a good way to store the encoding would be. We could probably assume that no associated encoding means UTF-8, and that would work for 99% of users. However I know that it's not uncommon in the Japanese community to have source files in other encodings (CIS-JIS? not sure).

So I'm tempted to suggest data.force_encoding(Encoding::UTF_8).to_sym as default behavior, but again not sure the small benefits for the large majority of users would be worth the fairly large loss for the few users out there not having UTF-8 symbols.

I'm not even sure how we could properly store an associated encoding, like is the ordering of Encoding.list reliable? If so we could store the encoding index, that would be just one Integer, so potentially one byte.

@casperisfine
Copy link
Author

Interestingly this is the cause of ruby-i18n/i18n#606

In short some I18n data files has UTF-8 symbol keys, so after a roundtrip in MessagePack they come back as ASCII-8BIT aka BINARY. I can work around it with the hack suggested above, but I figured it was worth reporting.

@casperisfine
Copy link
Author

So actually the I18n issue wasn't with register_type(.., Symbol) but with the symbolize_keys: true parsing option. I submitted a fix here: #246

casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with the encoding
index.

An alternative could be to simply assume non-ASCII symbols are UTF-8,
but it wouldn't work for people with non UTF-8 source files.
casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with the encoding
index. Note that the encoding index isn't fixed across ruby platforms
and version, but the cache versioning should protect us from that.

An alternative could be to simply assume non-ASCII symbols are UTF-8,
but it wouldn't work for people with non UTF-8 source files.
casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with the encoding
index. Note that the encoding index isn't fixed across ruby platforms
and version, but the cache versioning should protect us from that.

An alternative could be to simply assume non-ASCII symbols are UTF-8,
but it wouldn't work for people with non UTF-8 source files.
casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with `1`
for UTF-8 symbols, and `0` for the others (ASCII or binary)

If `Encoding.default_internal` is set to something MessagePack
doesn't support, we entirely disable the YAML cache.
casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with `1`
for UTF-8 symbols, and `0` for the others (ASCII or binary)

If `Encoding.default_internal` is set to something MessagePack
doesn't support, we entirely disable the YAML cache.
casperisfine pushed a commit to Shopify/bootsnap that referenced this pull request Jan 28, 2022
Ref: msgpack/msgpack-ruby#211

The default msgpack Symbol packer/unpacker is not encoding
aware which cause all non-ASCII symbols to be unpacked with
ASCII-8BIT encoding aka BINARY.

So we define a custom packer that prefix the symbol name with `1`
for UTF-8 symbols, and `0` for the others (ASCII or binary)

If `Encoding.default_internal` is set to something MessagePack
doesn't support, we entirely disable the YAML cache.
@casperisfine
Copy link
Author

So after working on Shopify/bootsnap#398, I think we could actually fix this without much compatibility concerns.

Message Pack strings are UTF-8 only, and msgpack-ruby never return anything else than UTF-8 or BINARY. Since BINARY symbols make no sense at all, I think we can simply cast symbols to UTF-8 and it would be perfectly consistent.

@casperisfine
Copy link
Author

Closing in favor of #248

@casperisfine casperisfine deleted the symbols-encoding branch January 31, 2022 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants