Consider reporting errors during encoding in ReaderInputStream #6944

bjmi · 2024-01-26T09:36:02Z

API(s)

ReaderInputStream(reader, charset, bufferSize)

How do you want it to be improved?

Use the default error action for malformed input and unmappable characters when creating the encoder by removing following lines:

.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)

and update the corr. Javadoc.

Why do we need it to be improved?

Data is silently changed and conversion errors are just ignored.
This lead to corrupt data, it is particularly bad if it goes undetected for a long period of time.
Therefore the default behavior should make the user aware of the problem.
Newer introduced API (JDK 11) like java.nio.file.Files.readString(..) and java.nio.file.Files.writeString(..) also use the reporting error approach.

An alternative would be to make ReaderInputStream public and allow to pass a Reader and a CharsetEncoder.
The encoder can be created by the caller via charset.newEncoder() easily and configured according the intended use case.
This was requested in #5376.

Example

No code needs to be changed beforehand. It's just about reporting a problem.

Current Behavior

Encoding errors are silently ignored and lead to corrupt text files.

Desired Behavior

Encoding errors should raise an exception and make errors visible, subsequently the code or data in question gets fixed.

Concrete Use Cases

We process texts from customers that use German and Cyrillic letters and it is crucial that the content remains intact when decoding / encoding is used.

Checklist

I agree to follow the code of conduct.
I have read and understood the contribution guidelines.
I have read and understood Guava's philosophy, and I strongly believe that this proposal aligns with it.

The text was updated successfully, but these errors were encountered:

bjmi · 2024-01-29T09:56:49Z

Further ideas

Introduce com.google.common.io.CharStreams.asInputStream(Reader, Charset): InputStream and com.google.common.io.CharStreams.asInputStream(Reader, CharsetEncoder): InputStream
Overload com.google.common.io.CharSource.asByteSource(Supplier<CharsetEncoder>)

Passing a CharsetEncoder allows you to define the behavior on malformed input and on unmappable characters. The existing Charset is actually an encoder factory.

chaoren · 2024-02-13T16:48:58Z

The relevant public API is CharSource.asByteSource(Charset).openStream(), right? I don't see ReaderInputStream being used from anywhere else.

bjmi · 2024-02-14T04:19:35Z

You are right :)

bjmi added the type=enhancement Make an existing feature better label Jan 26, 2024

chaoren added status=triaged package=io P3 labels Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider reporting errors during encoding in ReaderInputStream #6944

Consider reporting errors during encoding in ReaderInputStream #6944

bjmi commented Jan 26, 2024 •

edited by chaoren

bjmi commented Jan 29, 2024 •

edited

chaoren commented Feb 13, 2024 •

edited

bjmi commented Feb 14, 2024

Consider reporting errors during encoding in ReaderInputStream #6944

Consider reporting errors during encoding in ReaderInputStream #6944

Comments

bjmi commented Jan 26, 2024 • edited by chaoren

API(s)

How do you want it to be improved?

Why do we need it to be improved?

Example

Current Behavior

Desired Behavior

Concrete Use Cases

Checklist

bjmi commented Jan 29, 2024 • edited

chaoren commented Feb 13, 2024 • edited

bjmi commented Feb 14, 2024

bjmi commented Jan 26, 2024 •

edited by chaoren

bjmi commented Jan 29, 2024 •

edited

chaoren commented Feb 13, 2024 •

edited