Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Entities cause split after the next character #511

Open
roelandvanbatenburg opened this issue Sep 29, 2023 · 4 comments
Open

HTML Entities cause split after the next character #511

roelandvanbatenburg opened this issue Sep 29, 2023 · 4 comments

Comments

@roelandvanbatenburg
Copy link

I noticed this behaviour in https://github.com/rusterlium/html5ever_elixir, but it seems to be part of this excellent project.

What I see is that HTML Entities cause a break split after the next regular character. So ""ABC">DE" will return the tokens "\"', "A", "BC", "\"", ">", "D", "E".

I expected one character token "\"ABC\">DE" instead.

I wrote a program (with ChatGPT as I am not very familiar with rust) to demonstrate. If there is an error in there, please let me know.

use tendril::SliceExt;
use html5ever::tokenizer::{BufferQueue, TokenSinkResult, TokenSink, Token};

struct TokenPrinter;

impl TokenSink for TokenPrinter {
    type Handle = (); // This can be unit type if you're not using handles in your implementation

    fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<Self::Handle> {
        match token {
            // Match and print tokens here
            _ => println!("{:?}", token),
        }
        TokenSinkResult::Continue
    }
}

fn main() {
    // The HTML string you want to tokenize
    let html_string = r#"&quot;ABC&quot;&gt;DE"#;

    // Tokenize the HTML string and print the tokens
    let mut tokenizer = html5ever::tokenizer::Tokenizer::new(TokenPrinter, Default::default());
    let mut input = BufferQueue::new();
    input.push_back(html_string.to_tendril());
    let _ = tokenizer.feed(&mut input);
    tokenizer.end();
}

gives the output

CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: "A"))
CharacterTokens(Tendril<UTF8>(inline: "BC"))
CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: ">"))
CharacterTokens(Tendril<UTF8>(inline: "D"))
CharacterTokens(Tendril<UTF8>(inline: "E"))
EOFToken

If you need any more information, please let me know.

@ryanavella
Copy link

Interestingly with just one small change, the string splits at slightly different boundaries.

-    let html_string = r#"&quot;ABC&quot;&gt;DE"#;
+    let html_string = r#"ABC&quot;&gt;DE"#;

Output:

CharacterTokens(Tendril<UTF8>(inline: "ABC"))
CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: ">"))
CharacterTokens(Tendril<UTF8>(inline: "D"))
CharacterTokens(Tendril<UTF8>(inline: "E"))
EOFToken

This is very unusual, my best explanation is that Tokenizer is more stateful than the docs one would lead one to believe. I have tried several combinations of non-default TokenizerOpts and can not convince the tokenizer to group character tokens together.

I can replicate the same behavior with xml5ever so we know it is not specific to html5ever.

@ryanavella
Copy link

I think I see why this is happening.

When processing named entities in do_named, the first character of the BufferQueue is eagerly removed (rather than peeked at) and then appended to self.name_buf_mut()

// peek + discard skips over newline normalization, therefore making it easier to
// un-consume
let c = unwrap_or_return!(tokenizer.peek(input), Stuck);
tokenizer.discard_char(input);
self.name_buf_mut().push_char(c);

Then in finish_name the character is unconditionally stuffed back into the front of the BufferQueue as a new StrTendril

input.push_front(StrTendril::from_slice(&self.name_buf()[name_len..]));

I made a few attempts at fixing this, though in doing so I caused other tests to fail. So I'm afraid I'm not familiar enough with the code to draft a proper fix. I see that @untitaker has most recently interacted with this code so perhaps they have some ideas?

@untitaker
Copy link
Contributor

untitaker commented Apr 21, 2024

it's not technically wrong for the tokenizer to produce multiple character tokens, and I don't think particular care is being taken in general in html5ever to ensure that multiple consecutive character tokens are being emitted as a single token. The behavior is purely oriented around what is most convenient and most efficient for the tokenizer itself.

In the HTML spec, the character token can only contain one character, the fact that a character token in html5ever can contain multiple characters is just a performance optimization when a large buffer needs to be forwarded as-is.

I believe there are more cases where character tokens are split up arbitrarily into multiple, and they might be harder to predict. I think it's better if users of the tokenizer handle this transparently.

@ryanavella
Copy link

ryanavella commented Apr 21, 2024

I'm generally in agreement RE: correctness. To me this is more of a performance concern.

I see two potential sources of overhead:

  • In html5ever itself, some bytes are unnecessarily inspected more than once.
  • In downstream user code, TokenSink::process_token (which could have non-trivial logic) is being dispatched 50% more than is necessary, amortized.

I'm going to play around with the code some more and see if I can get all tests to pass while avoiding extra splitting of the input buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants