Implement differential fuzzer for pandoc #673

notriddle · 2023-06-22T23:47:53Z

No description provided.

Martin1887 · 2023-06-23T11:24:04Z

Thanks for your contribution.

The goal of the project is supporting CommonMark and Github Flavored Markdown, Pandoc target is far from the scope. May this fuzzer provide help to catch errors in CommonMark+GFM? The only case I find is when both pulldown-cmark and commonmark.js are wrong and Pandoc does the job well.

On the other hand, this code is independent of the final binary and only a dev tool. What do you think, @raphlinus?

notriddle · 2023-06-23T14:19:31Z

May this fuzzer provide help to catch errors in CommonMark+GFM?

That's what I'm thinking, yeah. Pandoc lets you select your extensions, such as commonmark+footnotes or commonmark+task_lists.

mgeisler · 2023-06-24T10:07:35Z

fuzz/src/lib.rs

@@ -186,6 +664,42 @@ pub fn xml_to_events(xml: &str) -> anyhow::Result<Vec<Event>> {
    Ok(events)
 }

+pub fn normalize_pandoc(events: Vec<Event<'_>>) -> Vec<Event<'_>> {


I think this is cool! I would rename normalize below to normalize_commonmarkjs or similar.

mgeisler · 2023-11-03T13:30:04Z

fuzz/fuzz_targets/pandoc.rs

+        use pulldown_cmark::{Event, Tag, TagEnd};
+        match event {
+            Event::Start(Tag::FootnoteDefinition(id)) => {
+                if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains("  ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return };


Perhaps it would be simpler to use the slice variants of the starts_with and ends_with patterns:

Suggested change

if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains(" ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return };

let whitespace = &['\n', '\r', ' ', '\t'];

if id.starts_with(whitespace)

|| id.ends_with(whitespace)

|| id.contains("\t ")

|| id.contains(" \t")

|| id.contains("\t\t")

{

return;

};

Based on pulldown-cmark#622 and copied from https://github.com/ollpu/pulldown-cmark/tree/alt-math. Co-authored-by: rhysd <lin90162@yahoo.co.jp>

This feature is loosely based on what 63a29a1 described, but copies [commonmark-hs] more closely (the balanced braces feature is added). [commonmark-hs]: https://github.com/nschloe/github-math-bugs It largely ignores GitHub, because its math parsing [is very buggy]. [is very buggy]: https://github.com/nschloe/github-math-bugs

@ollpu

This approach, based on @ollpu's suggestion, tracks single `$`s in the inline tree, and merges them later. It avoids having to merge and unmerge them in some corner cases.

The essential problem is: every time you write `$$x$}`, you get another entry added to a hash table. Even if it's not [theoretically] *quadratic*, it's still slow. Hard limiting it to 255 entries makes this not a problem. Interestingly enough, when I tried to write an analogous torture test for code spans, I couldn't find a way to do it because code spans are keyed by their *length* instead of their *position*. In order to get N entries in the hash table, I basically had to write N `` ` `` in a row, forcing me to write quadratic amounts of input text. Comparison: ``` michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.old -M > /dev/null target/release/pulldown-cmark.old -M > /dev/null 2.63s user 0.02s system 99% cpu 2.673 total michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.new -M > /dev/null target/release/pulldown-cmark.new -M > /dev/null 0.01s user 0.00s system 6% cpu 0.109 total ``` [theoretically]: http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

Co-authored-by: Linda_pp <rhysd@users.noreply.github.com>

This changes things so that `$$ $ $$` is not parsed as display math. Doing that doesn't actually make sense, since it's going to make a parse error at the end anyway. https://pandoc.org/try/?params=%7B%22text%22%3A%22%24%24+%24+%24%24%22%2C%22to%22%3A%22html5%22%2C%22from%22%3A%22commonmark_x%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D

- Disallow $$ matching a closing $ and then marching delimiters in `make_math_span`. Instead, retry scanning at the second position. - Remove the `seen_first` optimization from `MathDelims`. It doesn't work with the retry strategy.

Co-authored-by: Michael Howell <michael@notriddle.com>

notriddle mentioned this pull request Jun 22, 2023

HTML has higher priority than block quotes #674

Closed

notriddle force-pushed the notriddle/fuzz-pandoc branch from 3b43baf to f28b622 Compare June 22, 2023 23:53

mgeisler reviewed Jun 24, 2023

View reviewed changes

notriddle force-pushed the notriddle/fuzz-pandoc branch 2 times, most recently from 26d9ebb to c459690 Compare June 27, 2023 18:34

notriddle force-pushed the notriddle/fuzz-pandoc branch 3 times, most recently from 64de992 to 567866b Compare October 13, 2023 01:11

notriddle force-pushed the notriddle/fuzz-pandoc branch 4 times, most recently from 69b81e6 to d03e618 Compare October 26, 2023 01:29

notriddle force-pushed the notriddle/fuzz-pandoc branch from 0e44799 to 56b45e0 Compare October 30, 2023 21:12

mgeisler reviewed Nov 3, 2023

View reviewed changes

notriddle force-pushed the notriddle/fuzz-pandoc branch from 8d3923e to 2098f86 Compare November 14, 2023 22:48

notriddle force-pushed the notriddle/fuzz-pandoc branch 2 times, most recently from 7cc429e to e6caf75 Compare November 24, 2023 22:54

notriddle force-pushed the notriddle/fuzz-pandoc branch 4 times, most recently from c8a7098 to 9b5cd31 Compare January 21, 2024 18:40

notriddle force-pushed the notriddle/fuzz-pandoc branch from 9b5cd31 to 30bbeb4 Compare January 23, 2024 20:23

Martin1887 force-pushed the master branch from cc49653 to 2c0aebf Compare March 1, 2024 17:11

notriddle force-pushed the notriddle/fuzz-pandoc branch 2 times, most recently from 99ca650 to 7c97616 Compare March 5, 2024 19:23

notriddle force-pushed the notriddle/fuzz-pandoc branch from 7c97616 to a17b615 Compare March 25, 2024 17:46

ollpu and others added 2 commits April 17, 2024 18:37

Initial math spec

370ac6b

Based on pulldown-cmark#622 and copied from https://github.com/ollpu/pulldown-cmark/tree/alt-math. Co-authored-by: rhysd <lin90162@yahoo.co.jp>

notriddle and others added 24 commits April 17, 2024 18:37

Clean up some minor nits from code review

e2bcc97

Use a better parsing strategy for display

b2cf23d

This approach, based on @ollpu's suggestion, tracks single `$`s in the inline tree, and merges them later. It avoids having to merge and unmerge them in some corner cases.

Clean up (and add another test) for $$x$x$$

552e0d1

Stop clear()ing the map after successful matches

d47ce0c

Use convenience function

8a503c9

Co-authored-by: Linda_pp <rhysd@users.noreply.github.com>

Clean up indentation

747d923

Adapt tests for html5ever normalization being removed

46f5da4

Fix parsing power bug where math doesn't block lists

ded8707

Avoid incorrect brace matching on $inline$$-shaped problems

39fbe58

Avoid some incorrect treatment of display math lookups

bfeadd5

Improve brace overflow heuristics

80fd33f

Fix bugs in handling of block structure

c405ef4

Add math- to HTML classes inline and display

2cacc65

docs: minor docstring error in display math variant

875cf51

chore: minor comment fix

13ddab0

fix(clippy): <= 0 comparison in a usize

e91b65c

Fix headers in math.txt

a278054

Keep populating math_delims after invalid delim

b547d00

Math refactor

48aed10

- Disallow $$ matching a closing $ and then marching delimiters in `make_math_span`. Instead, retry scanning at the second position. - Remove the `seen_first` optimization from `MathDelims`. It doesn't work with the retry strategy.

Disambiguate regression test

3d8bb1f

Co-authored-by: Michael Howell <michael@notriddle.com>

Pandoc fuzzer

a59f6ea

Update lockfile

4e02fac

notriddle force-pushed the notriddle/fuzz-pandoc branch from a17b615 to 4e02fac Compare April 18, 2024 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement differential fuzzer for pandoc #673

Implement differential fuzzer for pandoc #673

notriddle commented Jun 22, 2023

Martin1887 commented Jun 23, 2023

notriddle commented Jun 23, 2023

mgeisler Jun 24, 2023

mgeisler Nov 3, 2023

Implement differential fuzzer for pandoc #673

Are you sure you want to change the base?

Implement differential fuzzer for pandoc #673

Conversation

notriddle commented Jun 22, 2023

Martin1887 commented Jun 23, 2023

notriddle commented Jun 23, 2023

mgeisler Jun 24, 2023

Choose a reason for hiding this comment

mgeisler Nov 3, 2023

Choose a reason for hiding this comment