Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON Unmarshal of some Unicode escape sequence results in error #1576

Open
denis-zyk opened this issue Nov 20, 2023 · 8 comments
Open

JSON Unmarshal of some Unicode escape sequence results in error #1576

denis-zyk opened this issue Nov 20, 2023 · 8 comments

Comments

@denis-zyk
Copy link

What version of protobuf and what language are you using?

protoc --version
libprotoc 23.4

protoc-gen-go --version
protoc-gen-go v1.31.0

Golang lib:

google.golang.org/protobuf v1.30.0

What did you do?
Steps to reproduce the behavior:

  1. Have a JSON string containing the following Unicode escape sequence \udca0, e.g. {"name":"'unsafe-eval'; object?src\udca0'none'"}.
  2. Unmarshal byte slice of JSON string in question to protobuf message using google.golang.org/protobuf/encoding/protojson package.
  3. Got error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Code snippet for reproducing issue:

package main

import (
	"encoding/json"
	"fmt"
	"google.golang.org/protobuf/encoding/protojson"
	
	"pkg.my.package/microservices/models"
)

func main() {
	j1 := []byte(`{"name":"'unsafe-eval'; object?src\uc2a0'none'"}`)	# unmarshal is OK
	j2 := []byte(`{"name":"'unsafe-eval'; object?src\udca0'none'"}`)	# unmarshal error in protojson

	// protojson lib

	// models.Client is a golang struct compiled using `protoc` binary, from Client protobuf message.
	// Protobuf message for the sake of reproducing has 1 string field:
	//   Name       string       `protobuf:"bytes,1,opt,name=name,proto3" json:"name"`
	outProto1 := &models.Client{}
	outProto2 := &models.Client{}

	unmarshaler := protojson.UnmarshalOptions{DiscardUnknown: true}

	err1 := unmarshaler.Unmarshal(j1, outProto1)
	err2 := unmarshaler.Unmarshal(j2, outProto2)

	fmt.Println("protojson lib")
	fmt.Println(string(j1), err1, outProto1)
	fmt.Println(string(j2), err2, outProto2)

	// core lib
	outCore1 := make(map[string]interface{})
	outCore2 := make(map[string]interface{})

	err1 = json.Unmarshal(j1, &outCore1)
	err2 = json.Unmarshal(j2, &outCore2)

	fmt.Println("core lib")
	fmt.Println(string(j1), err1, outCore1)
	fmt.Println(string(j2), err2, outCore2)
}

The example code snippet above produces the following output:

protojson lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> name:"'unsafe-eval'; object?src슠'none'"
{"name":"'unsafe-eval'; object?src\udca0'none'"} proto: syntax error (line 1:9): invalid escape code "'none'" in string 
core lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> map[name:'unsafe-eval'; object?src슠'none']
{"name":"'unsafe-eval'; object?src\udca0'none'"} <nil> map[name:'unsafe-eval'; object?src�'none']

What did you expect to see?
Unmarshalled value name:"'unsafe-eval'; object?src�'none'". I.e. same unmarshalled string value as in Golang core JSON lib.

What did you see instead?
Unpopulated message and an error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Anything else we should know about your project / environment?
Reproduced & confirmed using the followed golang versions and environments:

Linux: go version go1.20.11 linux/amd64
macOS: go version go1.21.4 darwin/arm64

@neild
Copy link
Contributor

neild commented Nov 20, 2023

U+DCA0 is a Unicode surrogate, and should not appear in valid UTF-8 text. A conformant UTF-8 decoder is required to reject input containing an unpaired surrogate; see "How do I convert an unpaired UTF-16 surrogate to UTF-8?" here: https://unicode.org/faq/utf_bom.html

I'm not sure about the history that led to encoding/json accepting this invalid input and protojson rejecting it, but I don't think protojson is doing the wrong thing here.

@puellanivis
Copy link
Collaborator

I think this is still generating an improper error message. It seems to be complaining that the invalid escape code is 'none' (which is not at all being used as an escape code), when it should probably be reporting the invalid escape code to be dca0, right?

@neild
Copy link
Contributor

neild commented Nov 20, 2023

Yes, you're right; I wasn't looking at the error message. The error is definitely wrong, or at least confusing.

Also, the protojson parser expects a surrogate to be followed by \uXXXX. I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

@puellanivis
Copy link
Collaborator

Yeah… not sure what the proper behavior should be in error conditions. But also, do we check at all that a Hi Surrogate is followed by a Lo Surrogate, and vice versa, or do we just check that surrogates occur in pairs?

I smell the familiar scent of a rabbit role into a pedantic reading of standards… 😩 At least the minimal solution here seems to be that if a surrogate isn’t followed by a surrogate, we report the escape itself as invalid? Or some sort of message that surrogates must be paired?

@dsnet
Copy link
Member

dsnet commented Nov 20, 2023

The JSON mapping for protobuf follows RFC 7493, which requires strict checking of surrogate halves (see section 2.1 of RFC 7493). The "encoding/json" package only adheres to RFC 8259, which leaves split surrogate halves as undefined behavior (see section 8.2 of RFC 8259). As a side note, the "encoding/json/v2" draft proposal will target compliance with RFC 7293 by default.

I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

A standalone surrogate half is invalid and should be rejected. The validity of a surrogate pair is determine using utf16.DecodeRune.

I don't see anything wrong with what protojson is doing. It correctly identified a surrogate half and was now expecting another escaped surrogate half. The error message says:

proto: syntax error (line 1:9): invalid escape code "'none'" in string

Strictly speaking, this error message is correct as 'none' is literally an invalid escape code.

@dsnet
Copy link
Member

dsnet commented Nov 20, 2023

The proposed future "encoding/json/jsontext" package produces a better error message:

jsontext: invalid surrogate pair `\udca0'none'` within string

as it preserves the context that this is occurring within a surrogate pair.

@puellanivis
Copy link
Collaborator

Interesting, I see what you mean with the “'none' is an invalid escape code” message. It also just coincidentally has the same length one would expect a \uXXXX escape sequence to be.

I do like the error message from jsontext which makes it clear exactly what is actually going wrong.

@lfolger
Copy link
Contributor

lfolger commented Nov 21, 2023

Improving the error message seems reasonasble to use and we are happy to accept a contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants