Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reproduction for ISO-8859-1 parsing on JRuby #2080

Closed
wants to merge 2 commits into from

Conversation

thbar
Copy link
Contributor

@thbar thbar commented Sep 8, 2020

As commented in this discussion, I've experienced regressions while parsing ISO-8859-1 documents under JRuby, with the recent release candidates (1.11.0.rc1, 1.11.0.rc2, 1.11.0.rc3), something which did not happen with 1.10.10.

I'm creating this first PR as an attempt to show the reproduction to Nokogiri maintainers (hopefully the build will fail, but we'll see!).

I will try to come back with a git bisect test later to isolate the commit which introduced this change.

@codeclimate
Copy link

codeclimate bot commented Sep 8, 2020

Code Climate has analyzed commit 386ebcb and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.3%.

View more on Code Climate.

@AppVeyorBot
Copy link

Build nokogiri 1.0.672 completed (commit 43396896ab by @thbar)

@thbar
Copy link
Contributor Author

thbar commented Sep 8, 2020

The build is catching the failure on JRuby indeed:

  1) Failure:
TestISO#test_iso_content_not_lacking_accents [/tmp/build/4d9a0f57/nokogiri/test/test_iso.rb:7]:
Expected: "Accepté"
  Actual: "Accept"

  2) Failure:
TestISO#test_iso_content_not_truncated [/tmp/build/4d9a0f57/nokogiri/test/test_iso.rb:13]:
Expected: 2
  Actual: 1

1521 runs, 4207 assertions, 2 failures, 0 errors, 24 skips

@flavorjones
Copy link
Member

Thank you for submitting this! Fantastic executable bug report. I appreciate it.

@flavorjones flavorjones added this to the v1.11.0 milestone Sep 8, 2020
@thbar
Copy link
Contributor Author

thbar commented Sep 8, 2020

@flavorjones you welcome :-) I may be able to get back with more information and a git bisect tomorrow, if all goes well.

@flavorjones
Copy link
Member

OK - I ran a git bisect and ba16682 introduced this regression. Tagging @jvshahid for his awareness 💕

nokogiri $ git bisect good
ba16682aae2cbd42c196bd8afdfcfe8a5d82fbdb is the first bad commit
commit ba16682aae2cbd42c196bd8afdfcfe8a5d82fbdb
Author: John Shahid <jvshahid@gmail.com>
Date:   Thu Apr 18 13:58:18 2019 -0400

    Split setInputSource into setIOInputSource and setStringInputSource
    
    This commit also consolidates some of the encoding handling logic that was
    repeated in multiple places.

 ext/java/nokogiri/HtmlDocument.java              |  25 +++---
 ext/java/nokogiri/HtmlSaxParserContext.java      | 100 ++++++-----------------
 ext/java/nokogiri/XmlDocument.java               |  33 +++-----
 ext/java/nokogiri/XmlSaxParserContext.java       |  14 ++--
 ext/java/nokogiri/internals/NokogiriHelpers.java |  12 +++
 ext/java/nokogiri/internals/ParserContext.java   |  71 ++++++----------
 6 files changed, 94 insertions(+), 161 deletions(-)

@thbar
Copy link
Contributor Author

thbar commented Sep 9, 2020

One point of concern is that the document appears to be truncated, without any error. I wonder why this is happening (instead of a hard error in fail-fast mode!).

If I find anything more, I will report back, but so far I'm just discovering the Java part.

jvshahid added a commit that referenced this pull request Sep 15, 2020
This change fixes the tests in #2080, but introduces more errors.  The errors
are mostly unexpected null encoding when parsing an HTML document.
@jvshahid jvshahid mentioned this pull request Sep 15, 2020
@jvshahid
Copy link
Member

I fixed the tests in #2083. @thbar, do you mind trying this branch to see if it fixes the issues you reported?

@thbar
Copy link
Contributor Author

thbar commented Sep 15, 2020

@jvshahid working on it! FWIW, just adding this to my app doesn't work at this point:

gem 'nokogiri', git: 'https://github.com/sparklemotion/nokogiri.git', branch: 'repro-iso-jruby'

I presume some form of compilation must be done locally first, I will try that out.

This would mean (side point) that the advice at https://bundler.io/guides/git.html is out of date!

I'll report back.

@thbar
Copy link
Contributor Author

thbar commented Sep 15, 2020

@jvshahid my app test suite pass ✅. Congrats for your fix!

For the record, here is roughly what I had to do:

git clone https://github.com/sparklemotion/nokogiri.git
cd nokogiri
git checkout repro-iso-jruby
rbenv shell jruby-9.2.5.0
bundle exec rake compile test
bundle exec rake gem:spec

Then use:

gem 'nokogiri', path: '../nokogiri'

and bundle install in my app.

@flavorjones
Copy link
Member

#2083 is green, I'll merge this and that as soon as I get a chance. Will be in the next v1.11.0 release candidate, hopefully within the next few days.

@flavorjones
Copy link
Member

Closing this pull request and merging #2083 which is a superset of these commits. Thanks again, @thbar -- I'll credit you in the changelog!

flavorjones added a commit that referenced this pull request Oct 12, 2020
flavorjones pushed a commit that referenced this pull request Oct 12, 2020
This change fixes the tests in #2080, but introduces more errors.  The errors
are mostly unexpected null encoding when parsing an HTML document.
flavorjones added a commit that referenced this pull request Oct 12, 2020
flavorjones added a commit that referenced this pull request Oct 12, 2020
flavorjones added a commit that referenced this pull request Oct 12, 2020
flavorjones added a commit that referenced this pull request Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants