Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an overloaded version of parse which does not require file encoding (charsetName) #1693

Closed
mahozad opened this issue Dec 21, 2021 · 3 comments
Milestone

Comments

@mahozad
Copy link

mahozad commented Dec 21, 2021

Is it possible to provide a Jsoup.parse(file) method which does not have the charset parameter?
It will make the code a tiny little bit more pleasant.

The method can use either of these approaches:

  • Assume UTF-8
  • Determine from http-equiv meta tag, if present
  • Try to guess the encoding
  • Use the default charset of the JVM

The first two are what is documented for the parse method when it is passed null.

I should say that Kotlin language standard library has File::readText extension function for Java File class that treats the file as UTF-8 if no charset is provided by the user.

@jhy
Copy link
Owner

jhy commented Dec 22, 2021

My thought process when designing that API was to make it explicit to the caller that they should aim to provide the character set, and that if it wasn't set, that jsoup would have to guess. The goal is to make it more explicit that a possibly incorrect default is going to be used.

One of jsoup's goals is to minimize dependencies and the required jar size, so I don't plan to include the Tika scan/guess.

I'm not clear on how we could both assume UTF-8 but also use the default charset of the JVM (if that were not UTF-8).

@mahozad
Copy link
Author

mahozad commented Dec 23, 2021

My suggestion was to use only one of the options.

So, it is not possible to provide a method like below?

public static Document parse(File file) throws IOException {
    return DataUtil.load(file, null, file.getAbsolutePath());
    // OR
    // return DataUtil.load(file, "UTF-8", file.getAbsolutePath());
}

@mahozad
Copy link
Author

mahozad commented Dec 23, 2021

As for the last option (using the default charset of the JVM), I think this feature scheduled for JDK 18 may be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants