Handle inline code spans with multiple backticks #56

mgeisler · 2023-06-03T15:59:48Z

Before, options.code_block_token was used for inline code spans. Now we always use a backtick.

As far as I can tell from the reference and from experimentation, code spans should always use one or more backticks and never use something else such as tildes.

The PR also ensures that we insert the necessary number of backticks to quote any backticks found in the code span itself. Before, only the case with a single backtick was handled, now the code can contain arbitrarily many backticks.

This was found using a fuzz test like the one in #55. The difference is that I restricted the input to a single paragraph of Markdown text, meaning that it only fuzzes inputs which match

[Event::Start(Tag::Paragraph), .., Event::End(Tag::Paragraph)]

There seems to be some more corner cases that are not handled already, so I'll try to send more PRs to fix those.

Before, `options.code_block_token` was used for inline code spans. Now we always use a backtick. As far as I can tell from the reference[1] and from experimentation[2], code spans should always use one or more backticks and never use something else such as tildes. [1]: https://spec.commonmark.org/0.30/#code-spans [2]: https://spec.commonmark.org/dingus/?text=%60foo%60%0A

We now insert the necessary number of backticks to quote any backticks found in the code span itself. Before, only the case with a single backtick was handled, now the code can contain arbitrarily many backticks. This was found using a fuzz test which tries to fuzz a single paragraph of Markdown text. that is, it only fuzzes text which matches [Event::Start(Tag::Paragraph), .., Event::End(Tag::Paragraph)]

src/lib.rs

mgeisler · 2023-06-03T16:45:38Z

One of the issues found by the fuzz test is pulldown-cmark/pulldown-cmark#655. The input "`\n`\n`" is parsed like this:

"`\n`\n`" -> [
  Start(Paragraph)
  Code(Borrowed(""))
  SoftBreak
  Text(Borrowed("`"))
  End(Paragraph)
]

This is wrong since the code span should contain " ", not "".

Byron

Thanks a million for your continued work - I feel that you will chip off some misbehaviour with each PR until this crate finally works correctly!

With that said, I have added some assertions for count_consecutive_backticks just to validate that it is supposed to work that way - my intuition is that it should abort early and no test breaks if it does. Maybe there can be another test to pin down the 'non-breaking' behaviour.

Finally, there is a question about whitespace.

Please let me know what you think - even as is it can be merged as I am sure the fuzzer will force all actual issues out already so it's OK to make mistakes on the way and start with fast broad strokes without need for details like the ones I brought up.

src/lib.rs

- avoid two allocations at the expense of adding complexity - add test to assure backtick-counting works as it should

mgeisler · 2023-06-04T11:13:12Z

Thanks a million for your continued work - I feel that you will chip off some misbehaviour with each PR until this crate finally works correctly!

I'm very happy to be able to help!

With that said, I have added some assertions for count_consecutive_backticks just to validate that it is supposed to work that way - my intuition is that it should abort early and no test breaks if it does. Maybe there can be another test to pin down the 'non-breaking' behaviour.

Thanks for adding the tests! I don't think it can abort early since it needs to look for backticks through the entire string. Well... there are two small things one could do:

Iterate over text.bytes() instead of text.chars() since we are counting an ASCII character (backtick) and since the UTF-8 encoding happens to preserve ASCII characters.
I guess the function could stop looking when it sees that max_backticks is larger than the remaining number of bytes. So if it has found 10 consequtive backticks and if !in_backticks, then the function could stop iterating 10 bytes before the end of the string. However, I don't think that will be worth it since the vast majority of strings only contain 1 or 2 backticks 😄

Byron · 2023-06-04T15:22:32Z

This sounds like making these changing to counting backticks goes beyond broad strokes 😁, so let's keep the momentum and make round-tripping possible :).

Thanks again, I can't wait to see more PRs just like this one.

mgeisler · 2023-06-04T15:55:32Z

Cool, thanks for merging it! I found what looks like another small issue in pulldown-cmark, see pulldown-cmark/pulldown-cmark#657.

mgeisler · 2023-06-04T20:10:20Z

Cool, thanks for merging it!

I found what looks like another small issue in pulldown-cmark, see pulldown-cmark/pulldown-cmark#657. It seems that combining the two crates like this is a fruitful way to tease out inconsistencies 😄

mgeisler added 2 commits June 3, 2023 17:45

mgeisler commented Jun 3, 2023

View reviewed changes

src/lib.rs Show resolved Hide resolved

thanks clippy

8f81a30

Byron approved these changes Jun 4, 2023

View reviewed changes

src/lib.rs Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

refactor

1cd68a4

- avoid two allocations at the expense of adding complexity - add test to assure backtick-counting works as it should

Byron force-pushed the inline-code branch from 0faa3d5 to 1cd68a4 Compare June 4, 2023 06:36

Byron merged commit c2a0113 into Byron:main Jun 4, 2023
1 check passed

mgeisler mentioned this pull request Dec 22, 2023

Is it possible to write back code block backtick count to original? #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle inline code spans with multiple backticks #56

Handle inline code spans with multiple backticks #56

mgeisler commented Jun 3, 2023

mgeisler commented Jun 3, 2023

Byron left a comment

mgeisler commented Jun 4, 2023

Byron commented Jun 4, 2023

mgeisler commented Jun 4, 2023

mgeisler commented Jun 4, 2023

Handle inline code spans with multiple backticks #56

Handle inline code spans with multiple backticks #56

Conversation

mgeisler commented Jun 3, 2023

mgeisler commented Jun 3, 2023

Byron left a comment

Choose a reason for hiding this comment

mgeisler commented Jun 4, 2023

Byron commented Jun 4, 2023

mgeisler commented Jun 4, 2023

mgeisler commented Jun 4, 2023