Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML blocks are split up in events #803

Open
clarfonthey opened this issue Jan 3, 2024 · 5 comments
Open

HTML blocks are split up in events #803

clarfonthey opened this issue Jan 3, 2024 · 5 comments

Comments

@clarfonthey
Copy link

Given the following test markdown:

Hello! <span>This is a test</span>.

<!--

Testing

-->

Nice!

The following events are generated by the commonmark dingus:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>Hello</text>
    <text>!</text>
    <text> </text>
    <html_inline>&lt;span&gt;</html_inline>
    <text>This is a test</text>
    <html_inline>&lt;/span&gt;</html_inline>
    <text>.</text>
  </paragraph>
  <html_block>&lt;!--

Testing

--&gt;</html_block>
  <paragraph>
    <text>Nice</text>
    <text>!</text>
  </paragraph>
</document>

Whereas these events are generated by pulldown-cmark:

[
    Start(
        Paragraph,
    ),
    Text(
        Borrowed(
            "Hello! ",
        ),
    ),
    Html(
        Borrowed(
            "<span>",
        ),
    ),
    Text(
        Borrowed(
            "This is a test",
        ),
    ),
    Html(
        Borrowed(
            "</span>",
        ),
    ),
    Text(
        Borrowed(
            ".",
        ),
    ),
    End(
        Paragraph,
    ),
    Html(
        Borrowed(
            "<!--\n",
        ),
    ),
    Html(
        Borrowed(
            "\n",
        ),
    ),
    Html(
        Borrowed(
            "Testing\n",
        ),
    ),
    Html(
        Borrowed(
            "\n",
        ),
    ),
    Html(
        Borrowed(
            "-->\n",
        ),
    ),
    Start(
        Paragraph,
    ),
    Text(
        Borrowed(
            "Nice!",
        ),
    ),
    End(
        Paragraph,
    ),
]

As you'll notice, the spec tries to keep the entire HTML block together as one unit, whereas pulldown-cmark will instead split the HTML at newlines, which isn't exactly ideal.

The main reason why I found this relevant is that it runs into an issue in zola when I try to match a <!-- more --> comment spread across multiple lines. The only way to potentially fix this issue would be to proactively combine these events, which would mean converting the borrowed slices into owned strings.

It seems like it would make the most sense for pulldown-cmark to keep these blocks together when returning them as events, since it seems to correctly parse it, just split it up weirdly.

@Martin1887
Copy link
Collaborator

Hello, text events are not guaranteed to be emitted in only one because the incremental nature of pulldown-cmark, and the same happens here with HTML blocks.

However, it's easy to wrap the event iterator to fix this, as is done for text events in https://github.com/raphlinus/pulldown-cmark/blob/master/src/utils.rs.

An example is also available in the main rustdoc, and it will be reflected in the documentation website when the 0.10 release (the first one with this utility) is launched:

//! Note that consecutive text events can happen due to the manner in which the
//! parser evaluates the source. A utility `TextMergeStream` exists to improve
//! the comfort of iterating the events:
//!
//! ```rust
//! use pulldown_cmark::{Event, Parser, TextMergeStream};
//!
//! let markdown_input = "Hello world, this is a ~~complicated~~ *very simple* example.";
//!
//! let iterator = TextMergeStream::new(Parser::new(markdown_input));
//!
//! for event in iterator {
//!     match event {
//!         Event::Text(text) => println!("{}", text),
//!         _ => {}
//!     }
//! }
//! ```

I hope this solves your issue.

@clarfonthey
Copy link
Author

clarfonthey commented Jan 15, 2024

While it is nice that this is offered as an addition, I'd like to disagree that it fully solves the problem. One of the big benefits of having something built into the library to do this is that it would be able to take advantage of the fact that these strings are genuinely adjacent to each other in the source code and avoid allocating a new string to concatenate them together. In most cases the performance benefits are near-zero, but again, this library already uses Cow to return the strings, so, clearly it doesn't see this as having no benefit.

Additionally, this only does this for Text that's adjacent, but not Html, so, that would need to be added too.

Would you be willing to merge an adapter that takes this into consideration? I'd be more than happy to dig into the code and make one, if you think it'd be good to upstream.

@Martin1887
Copy link
Collaborator

Martin1887 commented Jan 15, 2024

Such a contribution will be welcome!, just take into account it is not trivial due to the incremental nature of the parser. But if you get a smart manner of achieving it, it would be a good enhancement for the library.

Thanks!

@ollpu
Copy link
Collaborator

ollpu commented Mar 23, 2024

The main reason the borrowed lines cannot be given as a single event is that they aren't always adjacent in the source text.

This can happen when the HTML block is indented due to external structure:

- Hello!
  <!--

  Testing

  -->

commonmark.js:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <list type="bullet" tight="true">
    <item>
      <paragraph>
        <text>Hello</text>
        <text>!</text>
      </paragraph>
      <html_block>&lt;!--

Testing

--&gt;</html_block>
    </item>
  </list>
</document>

As you can see, the indentation is trimmed, as per the spec. It would be impossible to give this as a single borrowed Cow.

Another example is if there are CRLF line breaks in the input file.

@BenjaminRi
Copy link

I wrote the TextMergeStream. What @ollpu says is correct. Also see: #507.

The text events don't always map one-to-one to the original source buffer of your input. Hence, at least some allocations are necessary. Of course, any performance improvements are welcome and appreciated - it is indeed true that the TextMergeStream is not fully optimal in terms of minimizing allocations. To improve the performance, tighter integration with the library would be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants