fuzzgen: Statistics framework #4868

afonso360 · 2022-09-04T20:15:26Z

👋 Hey,

This PR Introduces a statistics module for fuzzgen. This allows us to measure stuff and get some info about what is going on.

I don't want to commit this in its current format since it has some issues, see the comments below.

Here's the output after 50000 inputs:

== FuzzGen Statistics  ====================
Total Inputs: 50000
Valid Inputs: 29130 (58.3%)
Total Runs: 42358
Successful Runs: 17564 (41.5% of Total Runs)
Timed out Runs: 12445 (29.4% of Total Runs)
Traps:
        user code: heap_oob: 0 (0.0% of Total Runs)
        user code: icall_null: 0 (0.0% of Total Runs)
        user code: int_ovf: 0 (0.0% of Total Runs)
        user code: int_divz: 12349 (29.2% of Total Runs)
        user debug: 0 (0.0% of Total Runs)
        user code: heap_misaligned: 0 (0.0% of Total Runs)
        user code: bad_sig: 0 (0.0% of Total Runs)
        user code: bad_toint: 0 (0.0% of Total Runs)
        user code: interrupt: 0 (0.0% of Total Runs)
        user code: unreachable: 0 (0.0% of Total Runs)
        user code: table_oob: 0 (0.0% of Total Runs)
        resumable: 0 (0.0% of Total Runs)
        user code: stk_ovf: 0 (0.0% of Total Runs)

afonso360 · 2022-09-04T20:17:23Z

fuzz/fuzz_targets/cranelift-fuzzgen.rs

+fuzz_target!(|bytes: &[u8]| {
+    STATISTICS.print();
+    STATISTICS.total_inputs.fetch_add(1, Ordering::SeqCst);
+
+    let mut unstructured = Unstructured::new(bytes);
+    let testcase = match TestCase::arbitrary(&mut unstructured) {
+        Ok(t) => t,
+        Err(_) => {
+            return;
+        }
+    };
+


I think doing this loses us the nice cargo fuzz fmt output which is not great. Do we have some other way of getting the amount of inputs that the fuzzer tried, while using the Arbitrary version of the macro?

Without total_inputs we can't calculate how many inputs we are dropping while generating a TestCase which is currently the most important metric to optimize. But I really don't want to lose cargo fuzz fmt.

As a crude hack, we could duplicate the logic in the libfuzzer crate that's triggered by the RUST_LIBFUZZER_DEBUG_PATH environment variable: https://github.com/rust-fuzz/libfuzzer/blob/63b922685a0e714d2e7d2af9050e8860d5f6dc09/src/lib.rs#L203-L218

I've been trying to work out how to gather statistics like this from all the fuzz targets, and the other problem that bothers me is how to report results at the end of the run. At the moment the option I find most appealing is if somebody sorts out rust-fuzz/libfuzzer#46. As I noted over there, it's also possible to use libc::atexit, but only if you're careful not to use any Rust I/O functions in the at-exit handler.

In a variety of fuzz targets I've written in the past, we have something like

static ITER: AtomicUsize = AtomicUsize::new(0); fuzz_target!(|data| { let n = ITER.fetch_add(1, Ordering::Relaxed); if log::is_enabled(log::Level::Info) && n % 1024 == 0 { dump_statistics(); } // ... });

and this has worked out pretty well, with adjustments to 1024 as needed for the fuzz target's throughput, of course.

That's approximately what Afonso has done in this PR, and what Andrew did in the differential fuzz target. It's a pattern that has already helped us find important issues in both fuzzers, so that's great!

An alternative is what the metrics_printer crate does: kick off a thread and dump metrics at a periodic interval. That avoids having to separately tune for each fuzz target's throughput.

But one thing I'd like to be able to do is get these statistics for OSS-Fuzz runs. I had the thought that instead of slowing down fuzzing, maybe we could fetch the corpus from OSS-Fuzz and replay it locally with stats collection turned on. Those corpora currently range from under 2k up to 25k tests. It's possible each corpus is heavily biased in terms of the statistics we care about, compared to the full run which produced it. So maybe this isn't an effective alternative to just dumping stats to the OSS-Fuzz stderr log. But if it is: I don't see any way right now to make metrics printing useful with so few tests, except by setting an environment variable with the number of inputs and having the fuzz target print after it has processed that many inputs.

Afonso's recent testing has a related issue. I noticed the test plan was "ask libfuzzer to run 100,010 inputs" and the statistics are printed after a hardcoded number of inputs (this PR says 50k but I'm guessing it was 100k for these experiments), so there are about 10 tests that get run at the end without being counted in the stats. If you try to make these two numbers equal, it's tedious to reason about whether the current test is included in the count or not, so adding a few extra tests on the end just to be sure makes total sense.

Again, an environment variable could help since we know how many tests we're trying to run. But in both cases I'd really rather just be able to print the statistics when libFuzzer is done with its run, and not have to tune anything at all.

As a crude hack, we could duplicate the logic in the libfuzzer crate that's triggered by the RUST_LIBFUZZER_DEBUG_PATH environment variable: https://github.com/rust-fuzz/libfuzzer/blob/63b922685a0e714d2e7d2af9050e8860d5f6dc09/src/lib.rs#L203-L218

Huh so that's how they do it! I never knew. I think there's a better solution, see below.

I've been trying to work out how to gather statistics like this from all the fuzz targets, and the other problem that bothers me is how to report results at the end of the run.

I think switching to iter % n == 0 for printing is a great idea! And when I'm measuring stuff for other PR's I can always adjust how many runs I do so that it prints at the end of the run. i.e. pick n=10k and run 50k when benchmarking or something like that.

It doesn't fix that exact issue, but at least for my use case it works, and it sounds like it would be useful for OSS-Fuzz as well.

And I've also been thinking maybe we can do without the total_inputs metric. total_inputs is only used for the "Valid Inputs %" and we can get that metric by other means. When looking at the fuzz log we can check the run number that printed this statistics and do some math (run 50k has 25k valid inputs, so its 50% valid). When benchmarking its also easy to calculate manually when presenting results.

Additionally we should soon be at 100% Valid inputs on this target (only 1 PR left! I hope), so it would become really obvious if we are missing some runs, since Valid inputs would no longer be 50000 or 100000 it would be something like 41273 and that would be something that would probably stand out.

It would still be nice to have a 100% there, but while we figure out some solution I think it would be worth merging this. (I'm also tired of cherry-picking these 3 commits everywhere 😄).

Edit:

Here's an example of what I'm proposing:

#46795 NEW cov: 31167 ft: 170205 corp: 5611/1461Kb lim: 1024 exec/s: 74 rss: 631Mb L: 804/1024 MS: 1 InsertRepeatedBytes- == FuzzGen Statistics ==================== Valid Inputs: 20000 Total Runs: 199404 Successful Runs: 197351 (99.0% of Total Runs) Timed out Runs: 189 (0.1% of Total Runs) Traps: user code: int_divz: 1864 (0.9% of Total Runs) #46979 NEW cov: 31167 ft: 170206 corp: 5612/1461Kb lim: 1024 exec/s: 74 rss: 631Mb L: 294/1024 MS: 4 ChangeBinInt-EraseBytes-CopyPart-PersAutoDict- DE: "\xff\xff\xff\xff\xff\xff\xff\xff"-

With this we know we had 20k valid inputs at run ~46795 (there is some uncertainty here) so we have roughly a 42.7% valid rate.

Now that I'm thinking about it this might not work out as I thought it would, since we don't have the guarantee that this prints at the end. So it wouldn't work very well for benchmarking unless we start measuring in terms of valid inputs, which is also doable.

I think removing the "total inputs" metric is a good plan. It at least lets us get something merged, even if it isn't perfect for all our use cases. I'll be happy to review and approve that!

github-actions · 2022-09-04T20:34:34Z

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "cranelift", "fuzzing"

Thus the following users have been cc'd because of the following labels:

fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fuzz/fuzz_targets/cranelift-fuzzgen.rs

afonso360 · 2022-09-24T05:45:53Z

I've made some changes since this was introduced:

We now filter out traps that were never triggered, and sort based on the count of traps.
As mentioned above, we now print based on "Valid Inputs" instead of "Total Inputs". Its not ideal but hopefully a temporary solution.

Here's the example output:

#99422  NEW    cov: 32426 ft: 189477 corp: 6858/3354Kb lim: 20182 exec/s: 50 rss: 912Mb L: 264/20154 MS: 2 PersAutoDict-InsertRepeatedBytes- DE: "=\x00I\x0a"-
== FuzzGen Statistics  ====================
Valid Inputs: 50000
Total Runs: 752563
Successful Runs: 736535 (97.9% of Total Runs)
Timed out Runs: 1076 (0.1% of Total Runs)
Traps:
        user code: int_divz: 10656 (1.4% of Total Runs)
        user code: int_ovf: 3131 (0.4% of Total Runs)
        user code: bad_toint: 1165 (0.2% of Total Runs)
#99528  NEW    cov: 32426 ft: 189488 corp: 6859/3359Kb lim: 20182 exec/s: 49 rss: 912Mb L: 4759/20154 MS: 1 CopyPart-

jameysharp

Looks good to me! Just needs a rebase because fuzz/Cargo.toml changed.

afonso360 commented Sep 4, 2022

View reviewed changes

afonso360 mentioned this pull request Sep 4, 2022

Tracking issue for the Cranelift CLIF-level differential fuzzer #3050

Open

32 tasks

github-actions bot added cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure labels Sep 4, 2022

afonso360 force-pushed the fuzz-stats branch from 2b3a2e0 to f89a591 Compare September 4, 2022 20:38

afonso360 commented Sep 5, 2022

View reviewed changes

fuzz/fuzz_targets/cranelift-fuzzgen.rs Show resolved Hide resolved

afonso360 force-pushed the fuzz-stats branch from f89a591 to c598638 Compare September 15, 2022 13:35

afonso360 force-pushed the fuzz-stats branch 2 times, most recently from 7d1d2c0 to 98155f3 Compare September 24, 2022 05:35

afonso360 marked this pull request as ready for review September 24, 2022 05:47

jameysharp approved these changes Sep 26, 2022

View reviewed changes

afonso360 added 5 commits September 27, 2022 07:30

cranelift: Add non user trap codes function

42d6680

cranelift: Add Fuzzgen stats

cc5fa3a

cranelift: Use once_cell and cleanup some stuff

ed0f2e5

fuzzgen: Remove total_inputs metric

f5623d9

fuzzgen: Filter empty trap codes

018cf35

afonso360 force-pushed the fuzz-stats branch from 98155f3 to 018cf35 Compare September 27, 2022 09:43

jameysharp enabled auto-merge (squash) September 27, 2022 16:01

jameysharp merged commit 65a3af7 into bytecodealliance:main Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzgen: Statistics framework #4868

fuzzgen: Statistics framework #4868

afonso360 commented Sep 4, 2022

afonso360 Sep 4, 2022 •

edited

jameysharp Sep 7, 2022

fitzgen Sep 7, 2022

jameysharp Sep 7, 2022

afonso360 Sep 23, 2022 •

edited

jameysharp Sep 23, 2022

github-actions bot commented Sep 4, 2022

afonso360 commented Sep 24, 2022 •

edited

jameysharp left a comment

fuzzgen: Statistics framework #4868

fuzzgen: Statistics framework #4868

Conversation

afonso360 commented Sep 4, 2022

afonso360 Sep 4, 2022 • edited

Choose a reason for hiding this comment

jameysharp Sep 7, 2022

Choose a reason for hiding this comment

fitzgen Sep 7, 2022

Choose a reason for hiding this comment

jameysharp Sep 7, 2022

Choose a reason for hiding this comment

afonso360 Sep 23, 2022 • edited

Choose a reason for hiding this comment

jameysharp Sep 23, 2022

Choose a reason for hiding this comment

github-actions bot commented Sep 4, 2022

Subscribe to Label Action

afonso360 commented Sep 24, 2022 • edited

jameysharp left a comment

Choose a reason for hiding this comment

afonso360 Sep 4, 2022 •

edited

afonso360 Sep 23, 2022 •

edited

afonso360 commented Sep 24, 2022 •

edited