HTML Entities cause split after the next character #511

roelandvanbatenburg · 2023-09-29T09:39:23Z

I noticed this behaviour in https://github.com/rusterlium/html5ever_elixir, but it seems to be part of this excellent project.

What I see is that HTML Entities cause a break split after the next regular character. So ""ABC">DE" will return the tokens "\"', "A", "BC", "\"", ">", "D", "E".

I expected one character token "\"ABC\">DE" instead.

I wrote a program (with ChatGPT as I am not very familiar with rust) to demonstrate. If there is an error in there, please let me know.

use tendril::SliceExt;
use html5ever::tokenizer::{BufferQueue, TokenSinkResult, TokenSink, Token};

struct TokenPrinter;

impl TokenSink for TokenPrinter {
    type Handle = (); // This can be unit type if you're not using handles in your implementation

    fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<Self::Handle> {
        match token {
            // Match and print tokens here
            _ => println!("{:?}", token),
        }
        TokenSinkResult::Continue
    }
}

fn main() {
    // The HTML string you want to tokenize
    let html_string = r#"&quot;ABC&quot;&gt;DE"#;

    // Tokenize the HTML string and print the tokens
    let mut tokenizer = html5ever::tokenizer::Tokenizer::new(TokenPrinter, Default::default());
    let mut input = BufferQueue::new();
    input.push_back(html_string.to_tendril());
    let _ = tokenizer.feed(&mut input);
    tokenizer.end();
}

gives the output

CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: "A"))
CharacterTokens(Tendril<UTF8>(inline: "BC"))
CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: ">"))
CharacterTokens(Tendril<UTF8>(inline: "D"))
CharacterTokens(Tendril<UTF8>(inline: "E"))
EOFToken

If you need any more information, please let me know.

The text was updated successfully, but these errors were encountered:

ryanavella · 2024-04-20T18:14:29Z

Interestingly with just one small change, the string splits at slightly different boundaries.

-    let html_string = r#"&quot;ABC&quot;&gt;DE"#;
+    let html_string = r#"ABC&quot;&gt;DE"#;

Output:

CharacterTokens(Tendril<UTF8>(inline: "ABC"))
CharacterTokens(Tendril<UTF8>(inline: "\""))
CharacterTokens(Tendril<UTF8>(inline: ">"))
CharacterTokens(Tendril<UTF8>(inline: "D"))
CharacterTokens(Tendril<UTF8>(inline: "E"))
EOFToken

This is very unusual, my best explanation is that Tokenizer is more stateful than the docs one would lead one to believe. I have tried several combinations of non-default TokenizerOpts and can not convince the tokenizer to group character tokens together.

I can replicate the same behavior with xml5ever so we know it is not specific to html5ever.

ryanavella · 2024-04-21T03:55:04Z

I think I see why this is happening.

When processing named entities in do_named, the first character of the BufferQueue is eagerly removed (rather than peeked at) and then appended to self.name_buf_mut()

html5ever/html5ever/src/tokenizer/char_ref/mod.rs

Lines 275 to 279 in 1ae2de3

    
           // peek + discard skips over newline normalization, therefore making it easier to 
        
           // un-consume 
        
           let c = unwrap_or_return!(tokenizer.peek(input), Stuck); 
        
           tokenizer.discard_char(input); 
        
           self.name_buf_mut().push_char(c);

Then in finish_name the character is unconditionally stuffed back into the front of the BufferQueue as a new StrTendril

html5ever/html5ever/src/tokenizer/char_ref/mod.rs

Line 383 in 1ae2de3

input.push_front(StrTendril::from_slice(&self.name_buf()[name_len..]));

I made a few attempts at fixing this, though in doing so I caused other tests to fail. So I'm afraid I'm not familiar enough with the code to draft a proper fix. I see that @untitaker has most recently interacted with this code so perhaps they have some ideas?

untitaker · 2024-04-21T13:52:18Z

it's not technically wrong for the tokenizer to produce multiple character tokens, and I don't think particular care is being taken in general in html5ever to ensure that multiple consecutive character tokens are being emitted as a single token. The behavior is purely oriented around what is most convenient and most efficient for the tokenizer itself.

In the HTML spec, the character token can only contain one character, the fact that a character token in html5ever can contain multiple characters is just a performance optimization when a large buffer needs to be forwarded as-is.

I believe there are more cases where character tokens are split up arbitrarily into multiple, and they might be harder to predict. I think it's better if users of the tokenizer handle this transparently.

ryanavella · 2024-04-21T20:44:49Z

I'm generally in agreement RE: correctness. To me this is more of a performance concern.

I see two potential sources of overhead:

In html5ever itself, some bytes are unnecessarily inspected more than once.
In downstream user code, TokenSink::process_token (which could have non-trivial logic) is being dispatched 50% more than is necessary, amortized.

I'm going to play around with the code some more and see if I can get all tests to pass while avoiding extra splitting of the input buffer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Entities cause split after the next character #511

HTML Entities cause split after the next character #511

roelandvanbatenburg commented Sep 29, 2023

ryanavella commented Apr 20, 2024

ryanavella commented Apr 21, 2024

untitaker commented Apr 21, 2024 •

edited

ryanavella commented Apr 21, 2024 •

edited

HTML Entities cause split after the next character #511

HTML Entities cause split after the next character #511

Comments

roelandvanbatenburg commented Sep 29, 2023

ryanavella commented Apr 20, 2024

ryanavella commented Apr 21, 2024

untitaker commented Apr 21, 2024 • edited

ryanavella commented Apr 21, 2024 • edited

untitaker commented Apr 21, 2024 •

edited

ryanavella commented Apr 21, 2024 •

edited