Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsoup 1.15.2 breaks short-reading InputStream #1807

Closed
brettkail-wk opened this issue Jul 11, 2022 · 6 comments
Closed

jsoup 1.15.2 breaks short-reading InputStream #1807

brettkail-wk opened this issue Jul 11, 2022 · 6 comments
Assignees
Labels
bug Confirmed bug that we should fix
Milestone

Comments

@brettkail-wk
Copy link

We have an InputStream implementation that internally chunks data, so read sometimes returns less than the requested length. The changes in #1774 (commit ccbd65f) no longer limit the overall amount across multiple calls to read, so Jsoup.parse fails.

This following test passes with 1.15.1:

@Test
public void test() throws Exception {
  class ShortReadInputStream extends InputStream {
    final InputStream in;
    ShortReadInputStream(InputStream in) { this.in = in; }
    public int read() throws IOException { return in.read(); }
    public int read(byte[] b) throws IOException { return in.read(b, 0, b.length); }
    public int read(byte[] b, int off, int len) throws IOException { return in.read(b, off, Math.min(len, 10)); }
  }
  var in = new ShortReadInputStream(new ByteArrayInputStream(new byte[0x8000]));
  Jsoup.parse(in, null, "", Parser.xmlParser());
}

...but it fails with 1.15.2:

java.io.IOException: Resetting to invalid mark
	at java.base/java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
	at org.jsoup.internal.ConstrainableInputStream.reset(ConstrainableInputStream.java:117)
	at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:149)
	at org.jsoup.helper.DataUtil.load(DataUtil.java:107)
	at org.jsoup.Jsoup.parse(Jsoup.java:197)
	at Test.test(Test.java:21)
@jhy
Copy link
Owner

jhy commented Jul 11, 2022

Arg, sorry about that! Will take a look.

Could you share your chunked InputStream impl? Or point to a work-alike if you know of one? It will help ensure a reliable fix here.

@jhy
Copy link
Owner

jhy commented Jul 12, 2022

Maybe we could add a random-size chunk generating stream as the test case here.

@brettkail-wk
Copy link
Author

brettkail-wk commented Jul 12, 2022

Unfortunately, our impl is proprietary, but the JDK PipedInputStream fails in a very similar way in 1.15.2 since it internally has a fixed-size buffer but works in 1.15.1.

@Test
public void test2() throws Exception {
  PipedOutputStream out = new PipedOutputStream();
  PipedInputStream in = new PipedInputStream(out);
  new Thread() {
    @Override
    public void run() {
      try (out) {
        out.write(new byte[0x8000]);
      } catch (IOException e) {
        throw new UncheckedIOException(e);
      }
    }
  }.start();
  Jsoup.parse(in, null, "");
}

@Siedlerchr
Copy link
Contributor

Got the same error in JabRef , we create a ProgressInputStream from an HttpConnection:

You can test: org.jabref/org.jabref.logic.importer.fileformat.ACMPortalParser.parseDoiSearchPage(ACMPortalParser.java:79)

JabRef/jabref#8987

    /**
     * Takes the web resource as the source for a monitored input stream.
     */
    public ProgressInputStream asInputStream() throws IOException {
        HttpURLConnection urlConnection = (HttpURLConnection) this.openConnection();

        if ((urlConnection.getResponseCode() == HttpURLConnection.HTTP_NOT_FOUND) || (urlConnection.getResponseCode() == HttpURLConnection.HTTP_BAD_REQUEST)) {
            LOGGER.error("Response message {} returned for url {}", urlConnection.getResponseMessage(), urlConnection.getURL());
            return new ProgressInputStream(new ByteArrayInputStream(new byte[0]), 0);
        }
        long fileSize = urlConnection.getContentLengthLong();
        return new ProgressInputStream(new BufferedInputStream(urlConnection.getInputStream()), fileSize);
    }

This ProgressInputStream is then passed to Jsoup:
The Code for ProgressInputStream is here:
https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/logic/net/ProgressInputStream.java

Document doc = Jsoup.parse(stream, null, HOST);



Caused by: java.io.IOException: Resetting to invalid mark
	at java.base/java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
	at org.jsoup@1.15.2/org.jsoup.internal.ConstrainableInputStream.reset(ConstrainableInputStream.java:100)
	at org.jsoup@1.15.2/org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:151)
	at org.jsoup@1.15.2/org.jsoup.helper.DataUtil.load(DataUtil.java:109)
	at org.jsoup@1.15.2/org.jsoup.Jsoup.parse(Jsoup.java:199)
	at org.jabref/org.jabref.logic.importer.fileformat.ACMPortalParser.parseDoiSearchPage(ACMPortalParser.java:70)
	... 72 more

@Siedlerchr

This comment was marked as off-topic.

@jhy jhy self-assigned this Aug 7, 2022
@jhy jhy added the bug Confirmed bug that we should fix label Aug 7, 2022
@jhy jhy added this to the 1.15.3 milestone Aug 7, 2022
@jhy jhy closed this as completed in c58112a Aug 7, 2022
@jhy
Copy link
Owner

jhy commented Aug 7, 2022

Thanks, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix
Projects
None yet
Development

No branches or pull requests

3 participants