HTML blocks are split up in events #803

clarfonthey · 2024-01-03T20:30:02Z

Given the following test markdown:

Hello! <span>This is a test</span>.

<!--

Testing

-->

Nice!

The following events are generated by the commonmark dingus:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>Hello</text>
    <text>!</text>
    <text> </text>
    <html_inline>&lt;span&gt;</html_inline>
    <text>This is a test</text>
    <html_inline>&lt;/span&gt;</html_inline>
    <text>.</text>
  </paragraph>
  <html_block>&lt;!--

Testing

--&gt;</html_block>
  <paragraph>
    <text>Nice</text>
    <text>!</text>
  </paragraph>
</document>

Whereas these events are generated by pulldown-cmark:

[
    Start(
        Paragraph,
    ),
    Text(
        Borrowed(
            "Hello! ",
        ),
    ),
    Html(
        Borrowed(
            "<span>",
        ),
    ),
    Text(
        Borrowed(
            "This is a test",
        ),
    ),
    Html(
        Borrowed(
            "</span>",
        ),
    ),
    Text(
        Borrowed(
            ".",
        ),
    ),
    End(
        Paragraph,
    ),
    Html(
        Borrowed(
            "<!--\n",
        ),
    ),
    Html(
        Borrowed(
            "\n",
        ),
    ),
    Html(
        Borrowed(
            "Testing\n",
        ),
    ),
    Html(
        Borrowed(
            "\n",
        ),
    ),
    Html(
        Borrowed(
            "-->\n",
        ),
    ),
    Start(
        Paragraph,
    ),
    Text(
        Borrowed(
            "Nice!",
        ),
    ),
    End(
        Paragraph,
    ),
]

As you'll notice, the spec tries to keep the entire HTML block together as one unit, whereas pulldown-cmark will instead split the HTML at newlines, which isn't exactly ideal.

The main reason why I found this relevant is that it runs into an issue in zola when I try to match a  comment spread across multiple lines. The only way to potentially fix this issue would be to proactively combine these events, which would mean converting the borrowed slices into owned strings.

It seems like it would make the most sense for pulldown-cmark to keep these blocks together when returning them as events, since it seems to correctly parse it, just split it up weirdly.

The text was updated successfully, but these errors were encountered:

Martin1887 · 2024-01-07T12:00:28Z

Hello, text events are not guaranteed to be emitted in only one because the incremental nature of pulldown-cmark, and the same happens here with HTML blocks.

However, it's easy to wrap the event iterator to fix this, as is done for text events in https://github.com/raphlinus/pulldown-cmark/blob/master/src/utils.rs.

An example is also available in the main rustdoc, and it will be reflected in the documentation website when the 0.10 release (the first one with this utility) is launched:

//! Note that consecutive text events can happen due to the manner in which the
//! parser evaluates the source. A utility `TextMergeStream` exists to improve
//! the comfort of iterating the events:
//!
//! ```rust
//! use pulldown_cmark::{Event, Parser, TextMergeStream};
//!
//! let markdown_input = "Hello world, this is a ~~complicated~~ *very simple* example.";
//!
//! let iterator = TextMergeStream::new(Parser::new(markdown_input));
//!
//! for event in iterator {
//!     match event {
//!         Event::Text(text) => println!("{}", text),
//!         _ => {}
//!     }
//! }
//! ```

I hope this solves your issue.

clarfonthey · 2024-01-15T17:27:06Z

While it is nice that this is offered as an addition, I'd like to disagree that it fully solves the problem. One of the big benefits of having something built into the library to do this is that it would be able to take advantage of the fact that these strings are genuinely adjacent to each other in the source code and avoid allocating a new string to concatenate them together. In most cases the performance benefits are near-zero, but again, this library already uses Cow to return the strings, so, clearly it doesn't see this as having no benefit.

Additionally, this only does this for Text that's adjacent, but not Html, so, that would need to be added too.

Would you be willing to merge an adapter that takes this into consideration? I'd be more than happy to dig into the code and make one, if you think it'd be good to upstream.

Martin1887 · 2024-01-15T17:44:10Z

Such a contribution will be welcome!, just take into account it is not trivial due to the incremental nature of the parser. But if you get a smart manner of achieving it, it would be a good enhancement for the library.

Thanks!

ollpu · 2024-03-23T08:41:34Z

The main reason the borrowed lines cannot be given as a single event is that they aren't always adjacent in the source text.

This can happen when the HTML block is indented due to external structure:

- Hello!
  <!--

  Testing

  -->

commonmark.js:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">

<document xmlns="http://commonmark.org/xml/1.0">
  <list type="bullet" tight="true">
    <item>
      <paragraph>
        <text>Hello</text>
        <text>!</text>
      </paragraph>
      <html_block>&lt;!--

Testing

--&gt;</html_block>
    </item>
  </list>
</document>

As you can see, the indentation is trimmed, as per the spec. It would be impossible to give this as a single borrowed Cow.

Another example is if there are CRLF line breaks in the input file.

BenjaminRi · 2024-04-28T13:46:08Z

I wrote the TextMergeStream. What @ollpu says is correct. Also see: #507.

The text events don't always map one-to-one to the original source buffer of your input. Hence, at least some allocations are necessary. Of course, any performance improvements are welcome and appreciated - it is indeed true that the TextMergeStream is not fully optimal in terms of minimizing allocations. To improve the performance, tighter integration with the library would be required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML blocks are split up in events #803

HTML blocks are split up in events #803

clarfonthey commented Jan 3, 2024

Martin1887 commented Jan 7, 2024

clarfonthey commented Jan 15, 2024 •

edited

Martin1887 commented Jan 15, 2024 •

edited

ollpu commented Mar 23, 2024

BenjaminRi commented Apr 28, 2024

HTML blocks are split up in events #803

HTML blocks are split up in events #803

Comments

clarfonthey commented Jan 3, 2024

Martin1887 commented Jan 7, 2024

clarfonthey commented Jan 15, 2024 • edited

Martin1887 commented Jan 15, 2024 • edited

ollpu commented Mar 23, 2024

BenjaminRi commented Apr 28, 2024

clarfonthey commented Jan 15, 2024 •

edited

Martin1887 commented Jan 15, 2024 •

edited