Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Metabug: Kill Breakpad #375

Closed
15 of 42 tasks
Gankra opened this issue Apr 29, 2021 · 20 comments
Closed
15 of 42 tasks

Project Metabug: Kill Breakpad #375

Gankra opened this issue Apr 29, 2021 · 20 comments

Comments

@Gankra
Copy link
Contributor

Gankra commented Apr 29, 2021

High-level goal: improve rust-minidump (and related libraries) to the point that it can replace all the uses of breakpad in Sentry and Firefox.

For now this should be restricted to the scope of:

  • x86, x64, ARM, ARM64
  • Windows, Android, MacOS, Linux, (iOS?)

Subtasks

NOTE: Larger tasks are checked off even if they have incomplete subtasks to indicate that they are complete for the purposes of the current milestone.


The Context

Minidumps are a Microsoft-designed format for more compact dumps of a process's state when it crashes, notably including full memory dumps of every thread's stacks/registers and mapped code modules (libraries that are linked in and what addresses they were mapped to).

Windows has native APIs for generating minidumps, but this is a feature that's desirable on other platforms, so google-breakpad was created to generate "fake" minidumps on other platforms and process them all uniformly. The most important output of this process is backtraces for every thread, but additional context stored in minidumps may be useful for debugging weird stuff like "the user's antivirus DLL-injected itself into out process and messed everything up" or "oh look the last syscall failed right before we crashed".

Both Firefox and Sentry rely on breakpad for minidump generation and handling. Unfortunately, breakpad is written in dangerous C++ and basically abandoned by google. Mozilla doesn't bother upstreaming our patches anymore, and it's too much work to maintain it.

Usecases

Here's the places where we use breakpad now that should work with a replacement. Each has a codename so that tasks/milestones can reference them.

Mozilla Usecases

  • minidump-stackwalk: On the server-side, Mozilla uses breakpad in minidump-stackwalk to process minidump-based crash reports for socorro.

  • moz-breakpad-client: On the client-side, Mozilla uses breakpad in our crash-reporter to generate minidumps. For content-process (~tab) crashes, the main process does this work out-of-crashing-process. For main-process (full browser) crashes, the main-process does this work in-crashing-process. Ideally we would have a separate crash-reporting process on the side that monitors the others so that all our handling can be out-of-crashing-process.

  • minidump-analyzer: On the client-side, Mozilla uses breakpad in our minidump-analyzer to try to analyze the contents of the minidump using the client machine's knowledge of its own system libraries and any local debuginfo we ship with firefox. This allows us to get more accurate symbolication/unwinding. (This also includes some of our own adhoc symbolication/unwinding code which is Buggy) and ideally would be replaced

  • moz-stackwalk: As a stretch-goal, this work would ideally also replace the need for moz-stackwalk (our own runtime backtracer for debug build backtraces and profiler probing) and fix-stacks (cleans up moz-stackwalk's output using native symbols).

Sentry Usecases

  • symbolicator: On the server-side, Sentry uses breakpad inside of symbolicator to process minidumps and extract a meaningful stack trace.

  • sentry-breakpad-client: On the client-side, sentry-native uses crashpad or alternatively breakpad to create minidumps of the crashing process to send over for server-side post-processing.

Microsoft Usecases?

TBD!

Current Roadmap

Milestone 1 - minidump-stackwalk

Metabug: rust-minidump/rust-minidump#153

Mozilla would like to first get the minidump-stackwalk usecase working, as it's the simplest but also very high traffic (performance matters), and processing user-provided data on our servers (security matters).

minidump-stackwalk only needs to handle pre-processed symbols from our symbol servers (i.e. the breakpad text format), and is operating completely offline from where the minidump was generated.

The hardest part will be generating backtraces for all the threads, which requires a complete offline unwinder.

Milestone 2 - minidump-analyzer

TBD, may choose different goal based on how Milestone 1 goes

Milestone 3 - symbolicator

TBD, may choose different goal based on how Milestone 2 goes

@gabrielesvelto
Copy link
Contributor

Under the first point we could add making rust-minidump print out as many error codes as possible in a readable format. We have patches on top of breakpad in bug 1704034 and bug 1703248. There's another one coming in bug 1705018 which would give complete coverage of Windows errors. The goal here would be to make all crash exception codes and GetLastError() values readable.

@gabrielesvelto
Copy link
Contributor

Also yeah, AArch64/Windows unwinding information is special mostly because it didn't match Microsoft documentation. There were quite a few holes and errors that Nathan had to fill when he implemented it. Our implementation in bug 1529355 is probably the gold standard for it.

@Gankra
Copy link
Contributor Author

Gankra commented Apr 29, 2021

One thing I didn't have a chance to mention in the meeting: does anyone know if gimli's dead unwind-rs has any meat to extract for our purposes?

See also:

Need to read through the former to get a sense of what to salvage from both our impls.

Edit: oh wait this is just shelling out to breakpad, ok cool, thought you had started writing more actual stackwalking code.

@Gankra
Copy link
Contributor Author

Gankra commented Apr 30, 2021

Oh fun the perf team has another unwinder for linux and android64 profiling: https://blog.mozilla.org/jseward/2014/05/13/lul-a-lightweight-unwinder-library-for-profiling-gecko/

Apparently it is really slow to start up: https://bugzilla.mozilla.org/show_bug.cgi?id=1635810

@gabrielesvelto
Copy link
Contributor

On the topic of forked-Breakpad-parity Steven Michaud just added code to read and write macOS-specific __crash_info structures, see bug 1577886. rust-minidump would only need the "reading" part since the "writing" one is part of the exception handler and minidump writer.

@luser
Copy link
Contributor

luser commented Apr 30, 2021

A few notes:

  • TODO: investigate major missing features to flesh out this list

rust-minidump/rust-minidump#22 is one. rust-minidump/rust-minidump#25 would be another, but it's a big chunk of work. Probably OK to punt on it for a first pass.

  • Support grabbing symbols from a symbol server

https://github.com/jrmuizel/pdb-downloader

  • Potentially create a new binary intermediate representation to replace breakpad's text format

    • Double-check if the binary breakpad format is right for this (would rather not by default)

Yeah, don't do that. I think the right thing to do is to move minidump-processor over to using symbolic directly. Then you could use its symcache format for this.

@gabrielesvelto
Copy link
Contributor

luser/rust-minidump#22 is one. luser/rust-minidump#25 would be another, but it's a big chunk of work. Probably OK to punt on it for a first pass.

Yeah those are just busywork. I'll probably tackle them myself in the coming months while I wait for builds and tests to complete.

@Gankra
Copy link
Contributor Author

Gankra commented May 1, 2021

I think next week I'm going to start trying to rewrite minidump-stackwalk in rust just to properly use rust-minidump in anger and feel out what it's actually missing, and potentially get some preliminary performance numbers.

@Swatinem
Copy link
Member

Swatinem commented May 3, 2021

Yeah, don't do that. I think the right thing to do is to move minidump-processor over to using symbolic directly. Then you could use its symcache format for this.

Another goal that we have is to revise the symcache format btw.
There are some long standing issues related to too conservative limits and other bugs.

In the end, integrating a compact and fast format for CFI is another goal. IMO, apples compact unwind format is a good inspiration, and I will catch up on https://blog.mozilla.org/jseward/2013/09/03/how-compactly-can-cfiexidx-stack-unwinding-info-be-represented/ as well.

@luser
Copy link
Contributor

luser commented May 3, 2021

I think next week I'm going to start trying to rewrite minidump-stackwalk in rust just to properly use rust-minidump in anger and feel out what it's actually missing, and potentially get some preliminary performance numbers.

FWIW, the minidump_stackwalk binary in the minidump-processor crate is pretty well modeled after the one in Breakpad (as you can tell). The simplest path to success there would be to pull in serde and make ProcessState serializable so you could write it out to JSON that matches the Socorro implementation (which I also wrote, naturally, so happy to answer any questions on that).

@Gankra
Copy link
Contributor Author

Gankra commented May 3, 2021

The simplest path to success there would be to pull in serde and make ProcessState serializable so you could write it out to JSON that matches the Socorro implementation (which I also wrote, naturally, so happy to answer any questions on that).

So I roughed that out but the socorro format and the ProcessState's structure differ a fair amount as far as a naive derive(Serialize) goes. To what extent is it reasonable to warp the structure of the crate to match the desired output for Firefox's specific tooling? (Of course, we can write manual serialize impls, but at that point I don't know if there's much benefit over just having all the JSON writing logic in minidump-stackwalk itself.)

@willkg
Copy link

willkg commented May 3, 2021

I (Socorro maintainer) am happy to change Socorro as needed. If it turns out there are scripts downstream for Socorro that need a certain structure, Socorro's processor can deal with supporting and migrating them.

@Gankra
Copy link
Contributor Author

Gankra commented May 4, 2021

I've made a metabug to specifically track what's needed for the first milestone: rust-minidump/rust-minidump#153

@luser
Copy link
Contributor

luser commented May 4, 2021

So I roughed that out but the socorro format and the ProcessState's structure differ a fair amount as far as a naive derive(Serialize) goes. To what extent is it reasonable to warp the structure of the crate to match the desired output for Firefox's specific tooling? (Of course, we can write manual serialize impls, but at that point I don't know if there's much benefit over just having all the JSON writing logic in minidump-stackwalk itself.)

That's fair. I don't have a real answer here. The existing ProcessState API is mostly a direct port from Breakpad. The Socorro JSON format was intentionally designed to contain all the information we thought would be useful to display in Socorro. The minidump-processor APIs haven't had any real-world usage outside of the binaries in my repo, AFAIK, so you should feel free to redesign them in any way you deem suitable.

@gabrielesvelto
Copy link
Contributor

I (Socorro maintainer) am happy to change Socorro as needed. If it turns out there are scripts downstream for Socorro that need a certain structure, Socorro's processor can deal with supporting and migrating them.

Something just occurred to me while I was thinking about the JSON output. Unfortunately Socorro is not the only downstream user. The minidump-analyzer which we run on user machines also emits Socorro-like JSON but with all the redundant fields stripped away. It's a subset basically. The generated stack trace is sent via crash pings and documented here.

This format has downstream users: Jim Mathies wrote a tool to symbolicate the crash pings and used Socorro's signature generation library to aggregate crash pings, see here.

@gabrielesvelto
Copy link
Contributor

And here's another thought I had in the context of the minidump-analyzer. Since we ship it on user machines it currently contains a lot of redundant code because Breakpad is all or nothing. For example we have to ship ARM unwinding support even in binaries that only run on x86 machines (and thus process only x86 minidumps).

If support for the various architectures and debuginfo formats could be chosen at compile time it would make the binaries quite a bit smaller.

Note this isn't a feature request, it's more like wishful thinking for when everything else will have already been done.

@willkg
Copy link

willkg commented May 5, 2021

It's possible that no one is looking at the minidump-related parts of crash pings other than Jimm's team. They're using fx-crash-sig (which I maintain) to take a crash ping, symbolicate it with Tecken (which I maintain), and run signature generation on it using siggen (which I maintain). So I can probably handle some amount of change depending on the details. Even so, @gabrielesvelto is right in that it's harder to absorb changes in the crash ping structure than what minidump-stackwalk generates that Socorro consumes.

@Gankra
Copy link
Contributor Author

Gankra commented Jun 10, 2021

We've made a lot of progress, so I've taken the opportunity to cleanup the metabug description and marked off everything that's done! We're getting surprisingly close to completing Milestone 1!

I've also given every usecase a proper bold label so that subtasks can more clearly indicate that they are only needed for a particular usecase.

@Swatinem
Copy link
Member

We removed breakpad from our own symbolicator service 🎉 , and the code in this repo is now behind a feature flag, and will be removed completely on the next major.

@luser
Copy link
Contributor

luser commented Apr 13, 2022

Awesome work, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants