Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow broken encoding in attibute values #60

Open
ST-DDT opened this issue Aug 9, 2018 · 1 comment
Open

Add option to allow broken encoding in attibute values #60

ST-DDT opened this issue Aug 9, 2018 · 1 comment

Comments

@ST-DDT
Copy link

ST-DDT commented Aug 9, 2018

I have to consume a message from a message broker with (sometimes) broken encoding in one of its attributes. (Its from a legacy software that nobody wants/dares to touch.)

Currently when trying to parse the mesages I get the following Exception:

com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #736, byte #53)
    at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
    at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
    at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
    ...

If I use the same bytes in a String directly it works perfectly fine.

It would be nice if I could use an option to allow broken encodings in my Strings instead of Exceptions.
(After parsing the input, I usually have enough context to know which messages I have to fix and how)

I use jackson-dataformat-xml 2.9.6 + woodstox 5.0.3/5.1 to parse the message.

Currently I use the following workaround to bypass the issue:

byte[] bytes = ...; 
try {
	return xmlMapper.readValue(bytes, StateInfo.class);
} catch (JsonParseException e) {
	try {
		LOG.debug("Attempting fix");
		byte[] bytes2 = new String(bytes, CHARSET_ALT1).getBytes(UTF_8);
		return xmlMapper.readValue(bytes2, StateInfo.class);
	} catch (JsonParseException e1) {
                // Contains special characters from multiple encodings (in different attributes)
		LOG.error("Failed to repair message - Writing message to disk for manual fix");
		writeToDisk(e, bytes);
		throw e;
	}
}

As an alternative I considered using a plain byte solution, but unfortunately the parser still tries to parse the input as String so it can use it with base64 encoding and I did't find a way to tell the parser just give me the bytes without reverse base64 it first.

Code to reproduce

Data class:

@JacksonXmlRootElement(localName = "data")
public static class Data {

	@JsonProperty("attr")
	public String attr;
	// public byte[] attr;

	@Override
	public String toString() {
		return "Data: "+ attr;
	}

}

Test method:

public static void main(String[] args) throws IOException {
	XmlMapper xmlMapper = new XmlMapper();
	String input = "<data attr=\"Success\" />";
	byte[] bytes = input.getBytes("UTF-8");

	System.out.println(new String(bytes, "UTF-8"));
	System.out.println(xmlMapper.readValue(bytes, Data.class));

	bytes[13] = (byte) 0xfc; // u -> ü // Simulate broken encoding

	System.out.println(new String(bytes, "UTF-8"));
	System.out.println(xmlMapper.readValue(bytes, Data.class)); // Error
}

Output:

<data attr="Success" />
Data: Success
<data attr="S�ccess" />
Exception in thread "main" com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
	at com.fasterxml.jackson.dataformat.xml.util.StaxUtil.throwAsParseException(StaxUtil.java:37)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:657)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:593)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29)
	at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:857)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3091)
	at example.Test.main(Test.java:67)
Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xfc (at char #14, byte #-1)
	at com.ctc.wstx.io.UTF8Reader.reportInvalidInitial(UTF8Reader.java:304)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:190)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:89)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:995)
	at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:754)
	at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2074)
	at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1175)
	at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:653)
	... 5 more
@cowtowncoder
Copy link
Member

When constructing String out of broken UTF-8 content, what happens? I am guessing invalid byte gets decoded as "question mark":

https://www.fileformat.info/info/unicode/char/0fffd/index.htm

which will then add garbage to attribute value.

I don't think this is something Woodstox should really be doing. Although I understand it may be inconvenient, I think handling of broken content is something that application needs to configure somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants