New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement differential fuzzer for pandoc #673
base: master
Are you sure you want to change the base?
Conversation
3b43baf
to
f28b622
Compare
Thanks for your contribution. The goal of the project is supporting CommonMark and Github Flavored Markdown, Pandoc target is far from the scope. May this fuzzer provide help to catch errors in CommonMark+GFM? The only case I find is when both pulldown-cmark and commonmark.js are wrong and Pandoc does the job well. On the other hand, this code is independent of the final binary and only a dev tool. What do you think, @raphlinus? |
That's what I'm thinking, yeah. Pandoc lets you select your extensions, such as |
fuzz/src/lib.rs
Outdated
@@ -186,6 +664,42 @@ pub fn xml_to_events(xml: &str) -> anyhow::Result<Vec<Event>> { | |||
Ok(events) | |||
} | |||
|
|||
pub fn normalize_pandoc(events: Vec<Event<'_>>) -> Vec<Event<'_>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is cool! I would rename normalize
below to normalize_commonmarkjs
or similar.
26d9ebb
to
c459690
Compare
64de992
to
567866b
Compare
69b81e6
to
d03e618
Compare
0e44799
to
56b45e0
Compare
use pulldown_cmark::{Event, Tag, TagEnd}; | ||
match event { | ||
Event::Start(Tag::FootnoteDefinition(id)) => { | ||
if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains(" ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be simpler to use the slice variants of the starts_with
and ends_with
patterns:
if id.starts_with("\n") || id.ends_with("\n") || id.starts_with("\r") || id.ends_with("\r") || id.starts_with(" ") || id.starts_with("\t") || id.contains(" ") || id.contains("\t ") || id.contains(" \t") || id.contains("\t\t") || id.ends_with(" ") || id.ends_with("\t") { return }; | |
let whitespace = &['\n', '\r', ' ', '\t']; | |
if id.starts_with(whitespace) | |
|| id.ends_with(whitespace) | |
|| id.contains("\t ") | |
|| id.contains(" \t") | |
|| id.contains("\t\t") | |
{ | |
return; | |
}; |
8d3923e
to
2098f86
Compare
7cc429e
to
e6caf75
Compare
c8a7098
to
9b5cd31
Compare
9b5cd31
to
30bbeb4
Compare
99ca650
to
7c97616
Compare
7c97616
to
a17b615
Compare
Based on pulldown-cmark#622 and copied from https://github.com/ollpu/pulldown-cmark/tree/alt-math. Co-authored-by: rhysd <lin90162@yahoo.co.jp>
This feature is loosely based on what 63a29a1 described, but copies [commonmark-hs] more closely (the balanced braces feature is added). [commonmark-hs]: https://github.com/nschloe/github-math-bugs It largely ignores GitHub, because its math parsing [is very buggy]. [is very buggy]: https://github.com/nschloe/github-math-bugs
This approach, based on @ollpu's suggestion, tracks single `$`s in the inline tree, and merges them later. It avoids having to merge and unmerge them in some corner cases.
The essential problem is: every time you write `$$x$}`, you get another entry added to a hash table. Even if it's not [theoretically] *quadratic*, it's still slow. Hard limiting it to 255 entries makes this not a problem. Interestingly enough, when I tried to write an analogous torture test for code spans, I couldn't find a way to do it because code spans are keyed by their *length* instead of their *position*. In order to get N entries in the hash table, I basically had to write N `` ` `` in a row, forcing me to write quadratic amounts of input text. Comparison: ``` michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.old -M > /dev/null target/release/pulldown-cmark.old -M > /dev/null 2.63s user 0.02s system 99% cpu 2.673 total michaelhowell@Michael-Howells-Macbook-Pro pulldown-cmark % python3 -c 'print("$$x$}"*5000)' | time target/release/pulldown-cmark.new -M > /dev/null target/release/pulldown-cmark.new -M > /dev/null 0.01s user 0.00s system 6% cpu 0.109 total ``` [theoretically]: http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
Co-authored-by: Linda_pp <rhysd@users.noreply.github.com>
This changes things so that `$$ $ $$` is not parsed as display math. Doing that doesn't actually make sense, since it's going to make a parse error at the end anyway. https://pandoc.org/try/?params=%7B%22text%22%3A%22%24%24+%24+%24%24%22%2C%22to%22%3A%22html5%22%2C%22from%22%3A%22commonmark_x%22%2C%22standalone%22%3Afalse%2C%22embed-resources%22%3Afalse%2C%22table-of-contents%22%3Afalse%2C%22number-sections%22%3Afalse%2C%22citeproc%22%3Afalse%2C%22html-math-method%22%3A%22plain%22%2C%22wrap%22%3A%22auto%22%2C%22highlight-style%22%3Anull%2C%22files%22%3A%7B%7D%2C%22template%22%3Anull%7D
- Disallow $$ matching a closing $ and then marching delimiters in `make_math_span`. Instead, retry scanning at the second position. - Remove the `seen_first` optimization from `MathDelims`. It doesn't work with the retry strategy.
Co-authored-by: Michael Howell <michael@notriddle.com>
a17b615
to
4e02fac
Compare
No description provided.