Skip to content

Commit

Permalink
Merge pull request #2278 from sparklemotion/flavorjones-introduce-htm…
Browse files Browse the repository at this point in the history
…l4-namespace

introduce html4 namespace

---

**What problem is this PR intended to solve?**

As the Nokogumbo merger progresses (see #2204), we now have an `HTML5` module and namespace, but the previous libxml2-(and nekohtml-) based functionality is parked under the ambiguous `HTML` module and namespace.

I'd like to disambiguate, and also introduce an opportunity for us to use `HTML` for more general use in the future (e.g., perhaps detection of HTML doc format and choosing the right DOM parser).

This PR moves everything currently under `HTML` to `HTML4`, and makes `HTML` an alias for `HTML4`. It updates doc strings and class names.

Some changes in behavior that I want to note:

- objects will report a class of `Nokogiri::HTML4::XXX` where they previously reported `Nokogiri::HTML::XXX`
- some of the exported C symbols have been renamed (e.g., `mNokogiriHTML` is now `mNokogiriHTML4`) which might impact anyone writing C code and linking against Nokogiri's dylib


**Have you included adequate test coverage?**

I've left the tests alone (except for the addition of some "HTML/HTML4 equivalence" tests) to demonstrate there's no behavioral breakage.

**Does this change affect the behavior of either the C or the Java implementations?**

Notably, I've updated the Java files to rename classes and variable, and use the proper module and class names, so that it stays in sync with CRuby despite not having an `HTML5` module/namespace.
  • Loading branch information
flavorjones committed Jun 21, 2021
2 parents 87f7d63 + 4ac9350 commit b022660
Show file tree
Hide file tree
Showing 74 changed files with 703 additions and 745 deletions.
3 changes: 3 additions & 0 deletions .yardopts
@@ -1,5 +1,8 @@
--embed-mixins
--main=README.md
--exclude=lib/nokogiri/css/tokenizer.rb
--exclude=lib/nokogiri/css/parser.rb
--exclude=ext/nokogiri/test_global_handlers.c
lib/**/*.rb
ext/nokogiri/*.c
-
Expand Down
21 changes: 17 additions & 4 deletions CHANGELOG.md
Expand Up @@ -6,17 +6,25 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

## next / unreleased

### Added
### Notable Addition: HTML5 Support (CRuby only)

__HTML5 support__ has been added (to CRuby only) by merging [Nokogumbo](https://github.com/rubys/nokogumbo) into Nokogiri. The Nokogumbo public API has been preserved, so this functionality is available under the `Nokogiri::HTML5` namespace. [[#2204](https://github.com/sparklemotion/nokogiri/issues/2204)]

Please note that HTML5 support is not available for JRuby in this version. However, we feel it is important to think about JRuby and we hope to work on this in the future. If you're interested in helping with HTML5 support on JRuby, please reach out to the maintainers by commenting on issue [#2227](https://github.com/sparklemotion/nokogiri/issues/2227).

Please also note that the `Nokogiri::HTML` parse methods still use libxml2's HTML4 parser in the v1.12 release series. Future releases of Nokogiri may change this behavior, but we'll proceed cautiously to avoid breaking existing applications.

Many thanks to Sam Ruby, Steve Checkoway, and Craig Barnes for creating and maintaining Nokogumbo and supporting the Gumbo HTML5 parser. They're now Nokogiri core contributors with all the powers and privileges pertaining thereto. 🙌

#### Other

### Notable Change: `Nokogiri::HTML4` module and namespace

`Nokogiri::HTML` has been renamed to `Nokogiri::HTML4`, and `Nokogiri::HTML` is aliased to preserve backwards-compatibility. `Nokogiri::HTML` and `Nokogiri::HTML4` parse methods still use libxml2's (or NekoHTML's) HTML4 parser in the v1.12 release series.

Take special note that if you rely on the class name of an object in your code, objects will now report a class of `Nokogiri::HTML4::Foo` where they previously reported `Nokogiri::HTML::Foo`. Instead of relying on the string returned by `Object#class`, prefer `Class#===` or `Object#is_a?` or `Object#instance_of?`.

Future releases of Nokogiri may deprecate `HTML` methods or otherwise change this behavior, so please start using `HTML4` in place of `HTML`.


### Added

* [CRuby] `Nokogiri::VERSION_INFO["libxslt"]["datetime_enabled"]` is a new boolean value which describes whether libxslt (or, more properly, libexslt) has compiled-in datetime support. This generally going to be `true`, but some distros ship without this support (e.g., some mingw UCRT-based packages, see https://github.com/msys2/MINGW-packages/pull/8957). See [#2272](https://github.com/sparklemotion/nokogiri/issues/2272) for more details.

Expand All @@ -38,6 +46,11 @@ Many thanks to Sam Ruby, Steve Checkoway, and Craig Barnes for creating and main
* [CRuby] Speed up (slightly) the compile time of packaged libraries `libiconv`, `libxml2`, and `libxslt` by using autoconf's `--disable-dependency-tracking` option. ("ruby" platform gem only.)


### Deprecated

* Deprecating Nokogumbo's `Nokogiri::HTML5.get`. This method will be removed in a future version of Nokogiri.


### Dependencies

* [CRuby] Upgrade mini_portile2 dependency from `~> 2.5.0` to `~> 2.6.1`. ("ruby" platform gem only.)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -14,7 +14,7 @@ Some guiding principles Nokogiri tries to follow:

## Features Overview

- DOM Parser for XML and HTML4
- DOM Parser for XML, HTML4, and HTML5
- SAX Parser for XML and HTML4
- Push Parser for XML and HTML4
- Document search via XPath 1.0
Expand Down
Expand Up @@ -18,13 +18,13 @@
import static nokogiri.internals.NokogiriHelpers.getNokogiriClass;

/**
* Class for Nokogiri::HTML::Document.
* Class for Nokogiri::HTML4::Document.
*
* @author sergio
* @author Yoko Harada <yokolet@gmail.com>
*/
@JRubyClass(name = "Nokogiri::HTML::Document", parent = "Nokogiri::XML::Document")
public class HtmlDocument extends XmlDocument
@JRubyClass(name = "Nokogiri::HTML4::Document", parent = "Nokogiri::XML::Document")
public class Html4Document extends XmlDocument
{
private static final String DEFAULT_CONTENT_TYPE = "html";
private static final String DEFAULT_PUBLIC_ID = "-//W3C//DTD HTML 4.01//EN";
Expand All @@ -33,19 +33,19 @@ public class HtmlDocument extends XmlDocument
private String parsed_encoding = null;

public
HtmlDocument(Ruby ruby, RubyClass klazz)
Html4Document(Ruby ruby, RubyClass klazz)
{
super(ruby, klazz);
}

public
HtmlDocument(Ruby runtime, Document document)
Html4Document(Ruby runtime, Document document)
{
this(runtime, getNokogiriClass(runtime, "Nokogiri::XML::Document"), document);
}

public
HtmlDocument(Ruby ruby, RubyClass klazz, Document doc)
Html4Document(Ruby ruby, RubyClass klazz, Document doc)
{
super(ruby, klazz, doc);
}
Expand All @@ -55,10 +55,10 @@ public class HtmlDocument extends XmlDocument
rbNew(ThreadContext context, IRubyObject klazz, IRubyObject[] args)
{
final Ruby runtime = context.runtime;
HtmlDocument htmlDocument;
Html4Document htmlDocument;
try {
Document docNode = createNewDocument(runtime);
htmlDocument = (HtmlDocument) NokogiriService.HTML_DOCUMENT_ALLOCATOR.allocate(runtime, (RubyClass) klazz);
htmlDocument = (Html4Document) NokogiriService.HTML_DOCUMENT_ALLOCATOR.allocate(runtime, (RubyClass) klazz);
htmlDocument.setDocumentNode(context.runtime, docNode);
} catch (Exception ex) {
throw asRuntimeError(runtime, "couldn't create document: ", ex);
Expand Down Expand Up @@ -135,13 +135,6 @@ public class HtmlDocument extends XmlDocument
return parsed_encoding;
}

/*
* call-seq:
* read_io(io, url, encoding, options)
*
* Read the HTML document from +io+ with given +url+, +encoding+,
* and +options+. See Nokogiri::HTML.parse
*/
@JRubyMethod(meta = true, required = 4)
public static IRubyObject
read_io(ThreadContext context, IRubyObject klass, IRubyObject[] args)
Expand All @@ -151,13 +144,6 @@ public class HtmlDocument extends XmlDocument
return ctx.parse(context, (RubyClass) klass, args[1]);
}

/*
* call-seq:
* read_memory(string, url, encoding, options)
*
* Read the HTML document contained in +string+ with given +url+, +encoding+,
* and +options+. See Nokogiri::HTML.parse
*/
@JRubyMethod(meta = true, required = 4)
public static IRubyObject
read_memory(ThreadContext context, IRubyObject klass, IRubyObject[] args)
Expand Down
Expand Up @@ -16,12 +16,12 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::ElementDescription.
* Class for Nokogiri::HTML4::ElementDescription.
*
* @author Patrick Mahoney <pat@polycrystal.org>
*/
@JRubyClass(name = "Nokogiri::HTML::ElementDescription")
public class HtmlElementDescription extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::ElementDescription")
public class Html4ElementDescription extends RubyObject
{

/**
Expand All @@ -38,7 +38,7 @@ public class HtmlElementDescription extends RubyObject
protected HTMLElements.Element element;

public
HtmlElementDescription(Ruby runtime, RubyClass rubyClass)
Html4ElementDescription(Ruby runtime, RubyClass rubyClass)
{
super(runtime, rubyClass);
}
Expand Down Expand Up @@ -89,8 +89,8 @@ public class HtmlElementDescription extends RubyObject
return context.nil;
}

HtmlElementDescription desc =
new HtmlElementDescription(context.getRuntime(), (RubyClass)klazz);
Html4ElementDescription desc =
new Html4ElementDescription(context.getRuntime(), (RubyClass)klazz);
desc.element = elem;
return desc;
}
Expand Down
Expand Up @@ -12,16 +12,16 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::EntityLookup.
* Class for Nokogiri::HTML4::EntityLookup.
*
* @author Patrick Mahoney <pat@polycrystal.org>
*/
@JRubyClass(name = "Nokogiri::HTML::EntityLookup")
public class HtmlEntityLookup extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::EntityLookup")
public class Html4EntityLookup extends RubyObject
{

public
HtmlEntityLookup(Ruby runtime, RubyClass rubyClass)
Html4EntityLookup(Ruby runtime, RubyClass rubyClass)
{
super(runtime, rubyClass);
}
Expand All @@ -41,7 +41,7 @@ public class HtmlEntityLookup extends RubyObject
if (val == -1) { return ruby.getNil(); }

IRubyObject edClass =
ruby.getClassFromPath("Nokogiri::HTML::EntityDescription");
ruby.getClassFromPath("Nokogiri::HTML4::EntityDescription");
IRubyObject edObj = invoke(context, edClass, "new",
ruby.newFixnum(val), ruby.newString(name),
ruby.newString(name + " entity"));
Expand Down
Expand Up @@ -24,27 +24,27 @@
import static nokogiri.internals.NokogiriHelpers.rubyStringToString;

/**
* Class for Nokogiri::HTML::SAX::ParserContext.
* Class for Nokogiri::HTML4::SAX::ParserContext.
*
* @author serabe
* @author Patrick Mahoney <pat@polycrystal.org>
* @author Yoko Harada <yokolet@gmail.com>
*/

@JRubyClass(name = "Nokogiri::HTML::SAX::ParserContext", parent = "Nokogiri::XML::SAX::ParserContext")
public class HtmlSaxParserContext extends XmlSaxParserContext
@JRubyClass(name = "Nokogiri::HTML4::SAX::ParserContext", parent = "Nokogiri::XML::SAX::ParserContext")
public class Html4SaxParserContext extends XmlSaxParserContext
{

static HtmlSaxParserContext
static Html4SaxParserContext
newInstance(final Ruby runtime, final RubyClass klazz)
{
HtmlSaxParserContext instance = new HtmlSaxParserContext(runtime, klazz);
Html4SaxParserContext instance = new Html4SaxParserContext(runtime, klazz);
instance.initialize(runtime);
return instance;
}

public
HtmlSaxParserContext(Ruby ruby, RubyClass rubyClass)
Html4SaxParserContext(Ruby ruby, RubyClass rubyClass)
{
super(ruby, rubyClass);
}
Expand All @@ -68,7 +68,7 @@ public class HtmlSaxParserContext extends XmlSaxParserContext
return parser;
} catch (SAXException ex) {
throw new SAXException(
"Problem while creating HTML SAX Parser: " + ex.toString());
"Problem while creating HTML4 SAX Parser: " + ex.toString());
}
}

Expand All @@ -79,7 +79,7 @@ public class HtmlSaxParserContext extends XmlSaxParserContext
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
CharSequence input = applyEncoding(rubyStringToString(data.convertToString()), javaEncoding);
Expand Down Expand Up @@ -231,7 +231,7 @@ static EncodingType get(final int ordinal)
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
ctx.setInputSourceFile(context, data);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
Expand All @@ -247,7 +247,7 @@ static EncodingType get(final int ordinal)
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
ctx.setIOInputSource(context, data, context.nil);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
Expand All @@ -258,12 +258,12 @@ static EncodingType get(final int ordinal)

/**
* Create a new parser context that will read from a raw input stream.
* Meant to be run in a separate thread by HtmlSaxPushParser.
* Meant to be run in a separate thread by Html4SaxPushParser.
*/
static HtmlSaxParserContext
static Html4SaxParserContext
parse_stream(final Ruby runtime, RubyClass klass, InputStream stream)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(runtime, klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(runtime, klass);
ctx.setInputSource(stream);
return ctx;
}
Expand Down
Expand Up @@ -27,25 +27,25 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::SAX::PushParser
* Class for Nokogiri::HTML4::SAX::PushParser
*
* @author
* @author Piotr Szmielew <p.szmielew@ava.waw.pl> - based on Nokogiri::XML::SAX::PushParser
*/
@JRubyClass(name = "Nokogiri::HTML::SAX::PushParser")
public class HtmlSaxPushParser extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::SAX::PushParser")
public class Html4SaxPushParser extends RubyObject
{
ParserContext.Options options;
IRubyObject saxParser;

NokogiriBlockingQueueInputStream stream;

private ParserTask parserTask = null;
private FutureTask<HtmlSaxParserContext> futureTask = null;
private FutureTask<Html4SaxParserContext> futureTask = null;
private ExecutorService executor = null;

public
HtmlSaxPushParser(Ruby ruby, RubyClass rubyClass)
Html4SaxPushParser(Ruby ruby, RubyClass rubyClass)
{
super(ruby, rubyClass);
}
Expand Down Expand Up @@ -111,7 +111,7 @@ public class HtmlSaxPushParser extends RubyObject
final ByteArrayInputStream data = NokogiriHelpers.stringBytesToStream(chunk);
if (data == null) {
terminateTask(context.runtime);
throw XmlSyntaxError.createHTMLSyntaxError(context.runtime).toThrowable(); // Nokogiri::HTML::SyntaxError
throw XmlSyntaxError.createHTMLSyntaxError(context.runtime).toThrowable(); // Nokogiri::HTML4::SyntaxError
}

int errorCount0 = parserTask.getErrorCount();
Expand Down Expand Up @@ -149,12 +149,12 @@ public class HtmlSaxPushParser extends RubyObject

assert saxParser != null : "saxParser null";
parserTask = new ParserTask(context, saxParser, stream);
futureTask = new FutureTask<HtmlSaxParserContext>((Callable) parserTask);
futureTask = new FutureTask<Html4SaxParserContext>((Callable) parserTask);
executor = Executors.newSingleThreadExecutor(new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setName("HtmlSaxPushParser");
t.setName("Html4SaxPushParser");
t.setDaemon(true);
return t;
}
Expand Down Expand Up @@ -187,14 +187,14 @@ public Thread newThread(Runnable r) {
futureTask = null;
}

private static HtmlSaxParserContext
private static Html4SaxParserContext
parse(final Ruby runtime, final InputStream stream)
{
RubyClass klazz = getNokogiriClass(runtime, "Nokogiri::HTML::SAX::ParserContext");
return HtmlSaxParserContext.parse_stream(runtime, klazz, stream);
RubyClass klazz = getNokogiriClass(runtime, "Nokogiri::HTML4::SAX::ParserContext");
return Html4SaxParserContext.parse_stream(runtime, klazz, stream);
}

static class ParserTask extends XmlSaxPushParser.ParserTask /* <HtmlSaxPushParser> */
static class ParserTask extends XmlSaxPushParser.ParserTask /* <Html4SaxPushParser> */
{

private
Expand All @@ -204,10 +204,10 @@ static class ParserTask extends XmlSaxPushParser.ParserTask /* <HtmlSaxPushParse
}

@Override
public HtmlSaxParserContext
public Html4SaxParserContext
call() throws Exception
{
return (HtmlSaxParserContext) super.call();
return (Html4SaxParserContext) super.call();
}

}
Expand Down

0 comments on commit b022660

Please sign in to comment.