Skip to content

Pure Java Nokogiri for JRuby

yokolet edited this page Sep 12, 2010 · 18 revisions

What is “pure Java” Nokogiri?

Pure Java version of Nokogiri is a Java port for JRuby. Currently, FFI version of Nokogiri works on JRuby via FFI library, which needs libxml2 installed. On the other hand, pure Java version doesn’t use libxml2 and FFI library. Nokogiri’s libxml2 dependent methods have been reimplemented by Apache Xerces, nekoHTML, and a couple of pure Java APIs. This means we don’t have limitation to use Nokogiri even on a pure Java environment. Yes, Nokogiri is available on Google App Engine. At this moment, pure Java version is on the way to the first release, and has 1 failure and 1 error reported from rake test on “master” branch. Currently, 1.5.0.beta.2 is out. If you want to use it, install with —pre.

Please note. Although rake test reports very few problems, pure Java Nokogiri still has weird behaviors in some areas. For example, handling spaces is not the same as cRuby version.

Installation

gem install nokogiri --pre

You need

  • JDK 1.6.0 and later
  • JRuby 1.5.1 and later

Google App Engine

Nokogiri 1.5.0.beta.2 version needs a small hack to run with google-appengine gem.

1) Comment out five require xxx.jar lines in .gems/bundler_gems/jruby/1.8/gems/nokogiri-1.5.0.beta.2-java/lib/nokogiri.rb

 
  1 # -*- coding: utf-8 -*-
  2 # Modify the PATH on windows so that the external DLLs will get loaded.
  3
  4 require 'rbconfig'
  5 ENV['PATH'] = [File.expand_path(
  6 File.join(File.dirname(__FILE__), "..", "ext", "nokogiri")
  7 ), ENV['PATH']].compact.join(';') if RbConfig::CONFIG['host_os'] =~ /(mswin|mingw)/i
  8
  9 if defined?(RUBY_ENGINE) && RUBY_ENGINE == "jruby"
 10 # require 'isorelax.jar'
 11 # require 'jing.jar'
 12 # require 'nekohtml.jar'
 13 # require 'nekodtd.jar'
 14 # require 'xercesImpl.jar'
 15 require 'nokogiri/nokogiri'
 16 else
 17 require 'nokogiri/nokogiri'
 18 end

2) Remove WEB-INF/lib/gems.jar (if you have this file)
3) Restart the server

Then, Nokogiri will start working. This bug was fixed in master, so 1.5.0 final release won’t have this problem.

Please note. Pure Java Nokogiri is not yet fully tested on Google App Engine. There might be GAE specific problems.

Pure Java Specific Rules

Porting to Java is not easy. Contributors have struggled over the different behaviors between libxml2 and Xerces. Almost all Nokogiri API are implemented as they are, but some were very hard to make. Thus, pure Java version has a few specific rules. Please be aware followings when you use pure Java version.

DTD validation

Add “dtdvalid” option when a document is read.

xml = Nokogiri::XML(File.open(XML_FILE)) {|cfg| cfg.dtdvalid}
list = xml.internal_subset.validate xml

The number of errors is not the same as libxml2 version. Java version doesn’t report errors of attributes whose elements have already reported errors.

Public ID in DOCTYPE declaration

Don’t forget to write the second parameter.

<!DOCTYPE foo PUBLIC “bar” "">

Namespace in fragment

Java API needs a correct namespace declaration even when parsing a fragment. Pure Java version adds the namespace declaration to the given fragment internally if needed. This might cause an error when exactly the same Namespace instance is required. For example, the code below:

doc = Nokogiri::XML <<-EOX
  <root xmlns:foo="http://flavorjon.es/" xmlns:bar="http://google.com/">
    <foo:existing></foo:existing>
  </root>
EOX
ns = doc.root.namespace_definitions.detect { |x| x.prefix == "bar" }
frag = doc.fragment "<bar:newnode></bar:newnode>"
p ns, frag.children.first.namespace

produces:
<#(Namespace:0xa40 { prefix = "bar", href = "http://google.com/" })>
<#(Namespace:0xa46 { prefix = "bar", href = "http://google.com/" })>

Get Involved: how to build

If you want to help pure Java Nokogiri, you need to build it after cloning the source. Charlie wrote a nice, easy-to follow blog entry,
Nokogiri Java Port: Help Us Finish It!
, which will help you, definitely.

Don’t forget. The codebase of pure Java Nokogiri has been merged into master. You don’t need to checkout any branch.