How to avoid `Nokogiri::HTML.parse` behavior #2998

YusukeSuzuki · 2023-09-29T05:58:06Z

YusukeSuzuki
Sep 29, 2023

Nokogiri::HTML.parse('<div>text</div>').to_html creates

"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>test</p></body></html>\n"

but Nokogiri::HTML.parse('text').to_html creates

"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>test</p></body></html>\n"

How do I work around this bug? without Nokogiri::HTML::DocumentFragment.parse

Answered by flavorjones

Sep 29, 2023

@YusukeSuzuki Thanks for asking this question. What you're seeing is how libxml2 (the underlying HTML4 parser) constructs a document around this fragment.

I'm curious why you don't want to use DocumentFragment as this is exactly the use case it addresses. Neither <div>text</div> nor text is a Document.

You may also want to use Nokogiri::HTML5 which uses libgumbo instead of libxml2, and that library follows the precise rules in the HTML5 spec around document structure:

Nokogiri::HTML5.parse('<div>text</div>').to_html
# => "<html><head></head><body><div>text</div></body></html>"

Nokogiri::HTML5.parse('text').to_html
# => "<html><head></head><body>text</body></html>"

But really, again, I su…

View full answer

flavorjones · 2023-09-29T13:45:45Z

flavorjones
Sep 29, 2023
Maintainer

@YusukeSuzuki Thanks for asking this question. What you're seeing is how libxml2 (the underlying HTML4 parser) constructs a document around this fragment.

I'm curious why you don't want to use DocumentFragment as this is exactly the use case it addresses. Neither <div>text</div> nor text is a Document.

You may also want to use Nokogiri::HTML5 which uses libgumbo instead of libxml2, and that library follows the precise rules in the HTML5 spec around document structure:

Nokogiri::HTML5.parse('<div>text</div>').to_html
# => "<html><head></head><body><div>text</div></body></html>"

Nokogiri::HTML5.parse('text').to_html
# => "<html><head></head><body>text</body></html>"

But really, again, I suggest you consider using fragment parsing:

Nokogiri::HTML5.fragment('<div>text</div>').to_html
# => "<div>text</div>"

Nokogiri::HTML5.fragment('text').to_html
# => "text"

1 reply

flavorjones Sep 29, 2023
Maintainer

By the way, in your original post I get different results than is written, which I assume is a copy-paste error? You report you see <p>:

Nokogiri::HTML.parse('<div>text</div>').to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>test</p></body></html>\n"

but I see <div>:

Nokogiri::HTML4.parse('<div>text</div>').to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html><body><div>text</div></body></html>\n"

Just wanted to make sure I understand what you're asking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid `Nokogiri::HTML.parse` behavior #2998

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to avoid Nokogiri::HTML.parse behavior #2998

YusukeSuzuki Sep 29, 2023

Replies: 1 comment · 1 reply

flavorjones Sep 29, 2023 Maintainer

flavorjones Sep 29, 2023 Maintainer

How to avoid `Nokogiri::HTML.parse` behavior #2998

YusukeSuzuki
Sep 29, 2023

Replies: 1 comment 1 reply

flavorjones
Sep 29, 2023
Maintainer

flavorjones Sep 29, 2023
Maintainer