Skip to content

Pure Java Nokogiri for JRuby

yokolet edited this page Sep 12, 2010 · 18 revisions

What is “pure Java” Nokogiri?

Pure Java version of Nokogiri is a Java port for JRuby. Currently, FFI version of Nokogiri works on JRuby via FFI library, and it needs libxml2 installed. On the other hand, pure Java version doesn’t use libxml2 and FFI library. Nokogiri’s libxml2 dependent methods have been implemented by Apache Xerces, nekoHTML, and a couple of more pure Java APIs. This means we don’t have limitation to use Nokogiri even on a pure Java environment. Yes, Nokogiri will be available on Google App Engine soon. Currently, pure Java version is on the way to the first release, and about 10 failures and errors in total are reported from rake test.

How to build

Since pure Java version of Nokogiri is not yet finished, there is no downloadable archive so far. You need to clone the source and build it. Charlie wrote a nice, easy-to follow blog entry,
Nokogiri Java Port: Help Us Finish It!
, which will help you, definitely.

You can build on JDK5, but you need JDK6 if you use W3C XML Schema. JRuby 1.5.0 is better than older versions.

Quick start

After you could build pure Java Nokogiri successfully, you may get stared by using Nokogiri’s jars and Ruby libraries, or Nokogiri gem.

If you prefer using Nokogiri’s jars and Ruby libraries, you need to have the path to lib/nokogiri in your $LOAD_PATH. For example,

jruby -I/Users/yoko/Projects/nokogiri/lib -S irb
$ jruby -S irb
irb(main):001:0> $LOAD_PATH << "/Users/yoko/Projects/nokogiri/lib"

or add $LOAD_PATH << "/Users/yoko/Projects/nokogiri/lib" in your Ruby file.

If you choose the gem, install it in your JRuby like other gems as in below:

jruby -S rake java:gem
jruby -S gem install path-to-nokogiri-home/pkg/nokogiri-1.4.0.20100415101221-java.gem
Successfully installed nokogiri-1.4.0.20100415101221-java
1 gem installed
Installing ri documentation for nokogiri-1.4.0.20100415101221-java...(snip)

Make sure your JRuby has pure Java Nokogiri.

$ jruby -S gem list
*** LOCAL GEMS *** hoe (2.6.0) jruby-openssl (0.6) json_pure (1.2.4) nokogiri (1.4.0.20100415101221) rake (0.8.7) rubyforge (2.0.4) sources (0.0.1)

Let’s try this simple example.

$ jruby -S irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'nokogiri'
=> true
irb(main):003:0> doc = Nokogiri::XML "<root><foo /><quux /></root>"
=> #<Nokogiri::XML::Document:0x7e4 name="document" children=[#<Nokogiri::XML::Element:0x7e2 name="root" children=[#<Nokogiri::XML::Element:0x7de name="foo">, #<Nokogiri::XML::Element:0x7e0 name="quux">]>]>
irb(main):004:0> doc.to_xml
=> "<?xml version=\"1.0\"?>\n<root>\n  <foo/><quux/>\n</root>\n\n"
irb(main):005:0> node = doc.at_css("foo")
=> #<Nokogiri::XML::Element:0x7de name="foo">
irb(main):006:0> node.next_element
=> #<Nokogiri::XML::Element:0x7e6 name="quux">
irb(main):007:0> 

Worked? If you have a question, go to nokogiri-talk.

Pure Java Specific Rules

Porting to Java is not easy. Contributors have struggled over the different behaviors between libxml2 and Xerces. Almost all Nokogiri API are implemented as they are, but some were very hard to make. Thus, pure Java version has a few specific rules. Please be aware followings when you use pure Java version.

DTD validation

Add “dtdvalid” option when a document is read.

xml = Nokogiri::XML(File.open(XML_FILE)) {|cfg| cfg.dtdvalid}
list = xml.internal_subset.validate xml

The number of errors is not the same as libxml2 version. Java version doesn’t report errors of attributes whose elements have already reported errors.

Public ID in DOCTYPE declaration

Don’t forget to write the second parameter.

<!DOCTYPE foo PUBLIC “bar” "">

Namespace in fragment

Java API needs a correct namespace declaration even when parsing a fragment. Pure Java version adds the namespace declaration to the given fragment internally if needed. This might cause an error when exactly the same Namespace instance is required. For example, the code below:

doc = Nokogiri::XML <<-EOX
  <root xmlns:foo="http://flavorjon.es/" xmlns:bar="http://google.com/">
    <foo:existing></foo:existing>
  </root>
EOX
ns = doc.root.namespace_definitions.detect { |x| x.prefix == "bar" }
frag = doc.fragment "<bar:newnode></bar:newnode>"
p ns, frag.children.first.namespace

produces:
<#(Namespace:0xa40 { prefix = "bar", href = "http://google.com/" })>
<#(Namespace:0xa46 { prefix = "bar", href = "http://google.com/" })>