Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] HTML5 document encoding differs from HTML4 #2801

Open
flavorjones opened this issue Feb 28, 2023 · 3 comments
Open

[bug] HTML5 document encoding differs from HTML4 #2801

flavorjones opened this issue Feb 28, 2023 · 3 comments

Comments

@flavorjones
Copy link
Member

flavorjones commented Feb 28, 2023

Please describe the bug

The encoding of an HTML5 document differs from the encoding of an HTML4 document:

Nokogiri::HTML4::Document.parse(File.open(SHIFT_JIS_HTML)).encoding == "Shift_JIS"
Nokogiri::HTML5::Document.parse(File.open(SHIFT_JIS_HTML)).encoding == "UTF-8"

I haven't had time to dig into why this is (and whether it's intended behavior), so I'm opening this issue to look into it later. cc @stevecheckoway

Help us reproduce what you're seeing

#! /usr/bin/env ruby

$: << "lib"
require 'nokogiri'
require_relative 'test/helper'

class Test < Nokogiri::TestCase
  describe "document encoding" do
    describe "HTML4" do
      describe "given a File" do
        it "should detect shift_jis" do
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML4::Document.parse(File.open(SHIFT_JIS_HTML)).encoding,
          )
        end
      end

      describe "given a File and an encoding" do
        it "should detect shift_jis" do
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML4::Document.parse(File.open(SHIFT_JIS_HTML), nil, "Shift_JIS").encoding,
          )
        end
      end

      describe "given a String" do
        it "should detect shift_jis" do
          # fails
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML4::Document.parse(File.read(SHIFT_JIS_HTML, encoding: "Shift_JIS")).encoding,
          )
        end
      end

      describe "given a String and an encoding" do
        it "should detect shift_jis" do
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML4::Document.parse(File.read(SHIFT_JIS_HTML), nil, "Shift_JIS").encoding,
          )
        end
      end
    end

    describe "HTML5" do
      describe "given a File" do
        it "should detect shift_jis" do
          # fails
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML5::Document.parse(File.open(SHIFT_JIS_HTML)).encoding,
          )
        end
      end

      describe "given a File and an encoding" do
        it "should detect shift_jis" do
          # errors
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML5::Document.parse(File.open(SHIFT_JIS_HTML), nil, "Shift_JIS").encoding,
          )
        end
      end

      describe "given a String" do
        it "should detect shift_jis" do
          # fails
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML5::Document.parse(File.read(SHIFT_JIS_HTML, encoding: "Shift_JIS")).encoding,
          )
        end
      end

      describe "given a String and an encoding" do
        it "should detect shift_jis" do
          # fails
          assert_equal(
            "Shift_JIS",
            Nokogiri::HTML5::Document.parse(File.read(SHIFT_JIS_HTML), nil, "Shift_JIS").encoding,
          )
        end
      end
    end
  end
end

yields

Error:
document encoding::HTML5::given a File and an encoding#test_0001_should detect shift_jis:
TypeError: no implicit conversion of Hash into Integer
    /home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5.rb:266:in `read'
    /home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5.rb:266:in `read_and_encode'
    /home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5/document.rb:119:in `do_parse'
    /home/flavorjones/code/oss/nokogiri/lib/nokogiri/html5/document.rb:95:in `parse'
    ./html5-document-encoding.rb:64:in `block (4 levels) in <class:Test>'

Failure:
document encoding::HTML5::given a File#test_0001_should detect shift_jis [./html5-document-encoding.rb:52]
Minitest::Assertion: Expected: "Shift_JIS"
  Actual: "UTF-8"

Failure:
document encoding::HTML5::given a String and an encoding#test_0001_should detect shift_jis [./html5-document-encoding.rb:82]
Minitest::Assertion: Expected: "Shift_JIS"
  Actual: "UTF-8"

Failure:
document encoding::HTML5::given a String#test_0001_should detect shift_jis [./html5-document-encoding.rb:72]
Minitest::Assertion: Expected: "Shift_JIS"
  Actual: "UTF-8"

Expected behavior

I think these should both be the same?

@flavorjones
Copy link
Member Author

flavorjones commented Feb 28, 2023

There are a couple of things worth digging into here:

  1. why isn't the HTML4 EncodingReader discovering the encoding in the "given a string" case? Never mind! This is because the parser uses the encoding of the String (so long as it's not binary/ascii-8bit).
  2. looks like passing a File and an encoding isn't supported by HTML5 (it raises an exception)
  3. can we make the HTML5 encoding detection match the HTML5 behavior? I think we might be able to use the EncodingReader if it's helpful.

@stevecheckoway
Copy link
Contributor

Since Gumbo doesn't support anything other than UTF-8, it performs the standard encoding detection pre-scan that browsers are supposed to perform to decide on the encoding and then uses that to convert to UTF-8 to pass to Gumbo.

I'm wasn't sure what the encoding property is supposed to return so I set it to the encoding of the strings in the Document itself rather than the encoding of the input document. If we want the latter, that is probably easy to change.

I'm not sure how this is supposed to interact with #to_html or other serialization methods though. Should those produce strings in the Document's encoding? I know Ruby has a concept of internal and external encoding for streams but I don't know how those should interact with this.

Basically, I don't know what the correct behavior around encodings is supposed to be but since Gumbo only supports UTF-8, I punted on it.

@flavorjones
Copy link
Member Author

@stevecheckoway Ah, that's really helpful! Now I'm remembering that we briefly talked about this at #2513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants