Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement differential fuzzer for pandoc #673

Draft
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

notriddle
Copy link
Collaborator

No description provided.

@Martin1887
Copy link
Collaborator

Thanks for your contribution.

The goal of the project is supporting CommonMark and Github Flavored Markdown, Pandoc target is far from the scope. May this fuzzer provide help to catch errors in CommonMark+GFM? The only case I find is when both pulldown-cmark and commonmark.js are wrong and Pandoc does the job well.

On the other hand, this code is independent of the final binary and only a dev tool. What do you think, @raphlinus?

@notriddle
Copy link
Collaborator Author

May this fuzzer provide help to catch errors in CommonMark+GFM?

That's what I'm thinking, yeah. Pandoc lets you select your extensions, such as commonmark+footnotes or commonmark+task_lists.

fuzz/src/lib.rs Outdated
@@ -186,6 +664,42 @@ pub fn xml_to_events(xml: &str) -> anyhow::Result<Vec<Event>> {
Ok(events)
}

pub fn normalize_pandoc(events: Vec<Event<'_>>) -> Vec<Event<'_>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is cool! I would rename normalize below to normalize_commonmarkjs or similar.

@notriddle notriddle force-pushed the notriddle/fuzz-pandoc branch 2 times, most recently from 26d9ebb to c459690 Compare June 27, 2023 18:34
@notriddle notriddle force-pushed the notriddle/fuzz-pandoc branch 3 times, most recently from 64de992 to 567866b Compare October 13, 2023 01:11
@notriddle notriddle force-pushed the notriddle/fuzz-pandoc branch 4 times, most recently from 69b81e6 to d03e618 Compare October 26, 2023 01:29
use pulldown_cmark::{Event, Tag, TagEnd};
match event {
Event::Start(Tag::FootnoteDefinition(id)) => {
if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains(" ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be simpler to use the slice variants of the starts_with and ends_with patterns:

Suggested change
if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains(" ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return };
let whitespace = &['\n', '\r', ' ', '\t'];
if id.starts_with(whitespace)
|| id.ends_with(whitespace)
|| id.contains("\t ")
|| id.contains(" \t")
|| id.contains("\t\t")
{
return;
};

ollpu and others added 2 commits April 17, 2024 18:37
Based on pulldown-cmark#622 and
copied from https://github.com/ollpu/pulldown-cmark/tree/alt-math.

Co-authored-by: rhysd <lin90162@yahoo.co.jp>
This feature is loosely based on what 63a29a1
described, but copies [commonmark-hs] more closely (the balanced braces
feature is added).

[commonmark-hs]: https://github.com/nschloe/github-math-bugs

It largely ignores GitHub, because its math parsing [is very buggy].

[is very buggy]: https://github.com/nschloe/github-math-bugs
notriddle and others added 24 commits April 17, 2024 18:37
This approach, based on @ollpu's suggestion, tracks single `$`s
in the inline tree, and merges them later. It avoids having
to merge and unmerge them in some corner cases.
The essential problem is: every time you write `$$x$}`, you get another
entry added to a hash table. Even if it's not [theoretically] *quadratic*,
it's still slow. Hard limiting it to 255 entries makes this not a problem.

Interestingly enough, when I tried to write an analogous torture test
for code spans, I couldn't find a way to do it because code spans are
keyed by their *length* instead of their *position*. In order to get
N entries in the hash table, I basically had to write N `` ` `` in a
row, forcing me to write quadratic amounts of input text.

Comparison:

```
michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.old -M > /dev/null
target/release/pulldown-cmark.old -M > /dev/null  2.63s user 0.02s system 99% cpu 2.673 total
michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.new -M > /dev/null
target/release/pulldown-cmark.new -M > /dev/null  0.01s user 0.00s system 6% cpu 0.109 total
```

[theoretically]: http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
Co-authored-by: Linda_pp <rhysd@users.noreply.github.com>
- Disallow $$ matching a closing $ and then marching delimiters in
  `make_math_span`. Instead, retry scanning at the second position.

- Remove the `seen_first` optimization from `MathDelims`. It doesn't
  work with the retry strategy.
Co-authored-by: Michael Howell <michael@notriddle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants