Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

mahozad · 2021-12-21T21:30:07Z

Is it possible to provide a Jsoup.parse(file) method which does not have the charset parameter?
It will make the code a tiny little bit more pleasant.

The method can use either of these approaches:

Assume UTF-8
Determine from http-equiv meta tag, if present
Try to guess the encoding
Use the default charset of the JVM

The first two are what is documented for the parse method when it is passed null.

I should say that Kotlin language standard library has File::readText extension function for Java File class that treats the file as UTF-8 if no charset is provided by the user.

The text was updated successfully, but these errors were encountered:

jhy · 2021-12-22T23:37:36Z

My thought process when designing that API was to make it explicit to the caller that they should aim to provide the character set, and that if it wasn't set, that jsoup would have to guess. The goal is to make it more explicit that a possibly incorrect default is going to be used.

One of jsoup's goals is to minimize dependencies and the required jar size, so I don't plan to include the Tika scan/guess.

I'm not clear on how we could both assume UTF-8 but also use the default charset of the JVM (if that were not UTF-8).

mahozad · 2021-12-23T07:56:43Z

My suggestion was to use only one of the options.

So, it is not possible to provide a method like below?

public static Document parse(File file) throws IOException {
    return DataUtil.load(file, null, file.getAbsolutePath());
    // OR
    // return DataUtil.load(file, "UTF-8", file.getAbsolutePath());
}

mahozad · 2021-12-23T14:23:07Z

As for the last option (using the default charset of the JVM), I think this feature scheduled for JDK 18 may be related.

jhy closed this as completed in 3a6e7fa Dec 28, 2021

jhy added this to the 1.15.1 milestone Dec 28, 2021

This was referenced May 16, 2022

Update jsoup from 1.14.3 to 1.15.1 apache/incubator-stormcrawler#968

Closed

Update jsoup from 1.14.2 to 1.15.1 code4craft/xsoup#54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

mahozad commented Dec 21, 2021

jhy commented Dec 22, 2021

mahozad commented Dec 23, 2021

mahozad commented Dec 23, 2021

Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

Comments

mahozad commented Dec 21, 2021

jhy commented Dec 22, 2021

mahozad commented Dec 23, 2021

mahozad commented Dec 23, 2021