Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protoparse: ParseFiles doesn't escape non-ascii characters in string values #441

Open
WillAbides opened this issue Oct 31, 2021 · 3 comments

Comments

@WillAbides
Copy link
Contributor

Comparing ParseFiles and protoc for the file below shows this difference:

  		"options": protocmp.Message{
  			"@type":                s"google.protobuf.FileOptions",
- 			"java_outer_classname": string("my\xbcvalue"),
+ 			"java_outer_classname": string("my�value"),
  		},

protoc seems to replace characters above 0x7f in option values with their \x escape code, but protoc does not.

syntax = "proto3";

package foo;
option java_outer_classname= "my�value";

The � above is 0xbc

@jhump
Copy link
Owner

jhump commented Nov 1, 2021

I think your example file may have been malformed when you pasted it into a GitHub comment. What I see is that � is the Unicode replacement character (U+FFFD), not 0xbc. (Which is expected and typical when you try to put illegal binary data into a UTF-8-encoded text document.)

Since protobuf source is supposed to be UTF-8, I'm surprised it would accept a literal 0xbc, since that is invalid UTF-8 (and, in fact, I would expect it to be turned into the replacement character, since that is the purpose of the replacement character).

Protoc definitely does not encode such values with an escape code. That is your diff library showing you the raw byte value as an escape sequence. The actual issue is that protoparse is seeing the invalid UTF-8 encoded value and silently converting it to the replacement character (because that is usually how UTF-8 handling is done). Apparently, protoc does not and allows unescaped data that is invalid UTF-8. (TBH, that smells like a bug in protoc's handling of input -- it should probably use the Unicode replacement character or outright reject the input file due to bad encoding.)

In any event, I think I know reasonably simple fix to make protoparse match protoc. But I'll actually file a bug against protoc first, to see what they think the correct behavior should be.

@WillAbides
Copy link
Contributor Author

You're right about it being the diff doing the escaping. That makes some inconsistent behavior I saw make sense.

Given that, it really does seem like it's protoc doing the wrong thing here and protoreflect shouldn't try to emulate it.

@jhump
Copy link
Owner

jhump commented Nov 1, 2021

I've filed a bug with the protobuf project: protocolbuffers/protobuf#9175

Depending on the response to that, I could update protoparse. I can see a couple of possible things I'd need to do:

  1. Reject source with invalid UTF-8 input, instead of silently converting invalid data to Unicode replacement characters.
  2. Even if the source is valid UTF-8 (e.g. is uses the escape sequence \xbc), it should be rejected if data for string options is not also valid UTF-8.

I think all of the runtimes reject string data if it is not UTF-8, so definitely strange that protoc accepts it. As far as it accepting the bad character directly in the source, I suspect that is because the tokenizer actually reads the input byte-by-byte instead of decoding UTF-8. It is ostensibly in UTF-8 since the only allowed byte-order marker prefix is a UTF-8-encoded marker; the rest of the source is likely expected to be 7-bit ASCII...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants