Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle inline code spans with multiple backticks #56

Merged
merged 4 commits into from
Jun 4, 2023

Conversation

mgeisler
Copy link
Contributor

@mgeisler mgeisler commented Jun 3, 2023

Before, options.code_block_token was used for inline code spans. Now we always use a backtick.

As far as I can tell from the reference and from experimentation, code spans should always use one or more backticks and never use something else such as tildes.

The PR also ensures that we insert the necessary number of backticks to quote any backticks found in the code span itself. Before, only the case with a single backtick was handled, now the code can contain arbitrarily many backticks.

This was found using a fuzz test like the one in #55. The difference is that I restricted the input to a single paragraph of Markdown text, meaning that it only fuzzes inputs which match

[Event::Start(Tag::Paragraph), .., Event::End(Tag::Paragraph)]

There seems to be some more corner cases that are not handled already, so I'll try to send more PRs to fix those.

Before, `options.code_block_token` was used for inline code spans. Now
we always use a backtick.

As far as I can tell from the reference[1] and from
experimentation[2], code spans should always use one or more backticks
and never use something else such as tildes.

[1]: https://spec.commonmark.org/0.30/#code-spans
[2]: https://spec.commonmark.org/dingus/?text=%60foo%60%0A
We now insert the necessary number of backticks to quote any backticks
found in the code span itself. Before, only the case with a single
backtick was handled, now the code can contain arbitrarily many
backticks.

This was found using a fuzz test which tries to fuzz a single
paragraph of Markdown text. that is, it only fuzzes text which matches

    [Event::Start(Tag::Paragraph), .., Event::End(Tag::Paragraph)]
src/lib.rs Show resolved Hide resolved
@mgeisler
Copy link
Contributor Author

mgeisler commented Jun 3, 2023

One of the issues found by the fuzz test is pulldown-cmark/pulldown-cmark#655. The input "`\n`\n`" is parsed like this:

"`\n`\n`" -> [
  Start(Paragraph)
  Code(Borrowed(""))
  SoftBreak
  Text(Borrowed("`"))
  End(Paragraph)
]

This is wrong since the code span should contain " ", not "".

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a million for your continued work - I feel that you will chip off some misbehaviour with each PR until this crate finally works correctly!

With that said, I have added some assertions for count_consecutive_backticks just to validate that it is supposed to work that way - my intuition is that it should abort early and no test breaks if it does. Maybe there can be another test to pin down the 'non-breaking' behaviour.

Finally, there is a question about whitespace.

Please let me know what you think - even as is it can be merged as I am sure the fuzzer will force all actual issues out already so it's OK to make mistakes on the way and start with fast broad strokes without need for details like the ones I brought up.

src/lib.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
- avoid two allocations at the expense of adding complexity
- add test to assure backtick-counting works as it should
@mgeisler
Copy link
Contributor Author

mgeisler commented Jun 4, 2023

Thanks a million for your continued work - I feel that you will chip off some misbehaviour with each PR until this crate finally works correctly!

I'm very happy to be able to help!

With that said, I have added some assertions for count_consecutive_backticks just to validate that it is supposed to work that way - my intuition is that it should abort early and no test breaks if it does. Maybe there can be another test to pin down the 'non-breaking' behaviour.

Thanks for adding the tests! I don't think it can abort early since it needs to look for backticks through the entire string. Well... there are two small things one could do:

  • Iterate over text.bytes() instead of text.chars() since we are counting an ASCII character (backtick) and since the UTF-8 encoding happens to preserve ASCII characters.
  • I guess the function could stop looking when it sees that max_backticks is larger than the remaining number of bytes. So if it has found 10 consequtive backticks and if !in_backticks, then the function could stop iterating 10 bytes before the end of the string. However, I don't think that will be worth it since the vast majority of strings only contain 1 or 2 backticks 😄

@Byron Byron merged commit c2a0113 into Byron:main Jun 4, 2023
1 check passed
@Byron
Copy link
Owner

Byron commented Jun 4, 2023

This sounds like making these changing to counting backticks goes beyond broad strokes 😁, so let's keep the momentum and make round-tripping possible :).

Thanks again, I can't wait to see more PRs just like this one.

@mgeisler
Copy link
Contributor Author

mgeisler commented Jun 4, 2023

Cool, thanks for merging it! I found what looks like another small issue in pulldown-cmark, see pulldown-cmark/pulldown-cmark#657.

@mgeisler
Copy link
Contributor Author

mgeisler commented Jun 4, 2023

Cool, thanks for merging it!

I found what looks like another small issue in pulldown-cmark, see pulldown-cmark/pulldown-cmark#657. It seems that combining the two crates like this is a fruitful way to tease out inconsistencies 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants