Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Use rustdoc JSON instead of parsing with syn #906

Open
madsmtm opened this issue Dec 7, 2023 · 5 comments
Open

Proposal: Use rustdoc JSON instead of parsing with syn #906

madsmtm opened this issue Dec 7, 2023 · 5 comments

Comments

@madsmtm
Copy link

madsmtm commented Dec 7, 2023

Parsing Rust code with syn is useful for macro authors, but has its limitations for whole-crate tools such as cbindgen, as is apparent in that it does not properly support namespacing since it has no access to type information and such (and the support for macro expansion is also a bit spotty).

In an ideal world, I think cbindgen should be implemented using rustc_driver, and distributed alongside the Rust compiler in rustup (perhaps with a cargo cbindgen subcommand included); this would give it all the power of rustc, and allow it to do full semantic analysis of the code; weird quirks begone!

But I recognize that that's a huge ask from several teams, and that it may not be desired for the Rust project to take ownership over this project (which is effectively a blocker for using rustc_driver, as doing that out-of-tree / without support from upstream is a huge hazzle).


Luckily, there exists something else we can use! Enter rustdoc's JSON output (documentation for the format here, can be access conveniently via. the rustdoc-types crate). This fully describes a crate's public API, which is more than enough for cbindgen's needs (it only needs to know all pub extern "C" fn, and their dependent types).

While the format itself is yet unstable, it is already depended on in the ecosystem, prominently cargo-semver-checks, so I'd say there is a high likelihood that it will not outright go away without some sort of replacement1.

So: I propose that cbindgen transitions to using this instead of syn, which would solve all of the aforementioned problems with namespacing (and likely make the implementation of cbindgen smaller, and capable of more advanced semantic type analysis in the future (example of something: knowledge of auto traits)).

What do you think?

1: If cbindgen ends up going this route, we should of course state so on the tracking issue.

@obi1kenobi
Copy link

While the format itself is yet unstable, it is already depended on in the ecosystem, prominently cargo-semver-checks

cargo-semver-checks author and maintainer here 👋 It's important to note that cargo-semver-checks invested heavily in infrastructure to isolate us from breaking changes in the rustdoc format, which is why format changes are now not such a big deal for us.

The infrastructure we use is all open-source and based on the Trustfall query engine and the trustfall-rustdoc and trustfall-rustdoc-adapter crates. It's discussed more in this 15min talk I gave at the P99 CONF conference.

If you're interested in reusing some of that work, I'd be happy to point you in the right direction and help you get started.

@emilio
Copy link
Collaborator

emilio commented Apr 14, 2024

Luckily, there exists something else we can use! Enter rust-lang/rust#76578 (documentation for the format here, can be access conveniently via. the rustdoc-types crate). This fully describes a crate's public API, which is more than enough for cbindgen's needs (it only needs to know all pub extern "C" fn, and their dependent types).

Is this true? We also need to know #[repr()] attributes, and non-public members of structs etc as well.

If rustdoc provides all that information, that might be an interesting approach. I'd be happy to take a patch that prototypes it assuming no regressions...

cc #287 for things that would allow us to handle macros.

@obi1kenobi
Copy link

cargo-semver-checks uses rustdoc JSON for all the data points you describe: it has lints that cover types' #[repr()] attributes, extern "C" and other ABIs on functions, and also needs to be able to see non-public members of structs since they are often semver-relevant in non-obvious ways.

We generate rustdoc JSON with the --document-private-items --document-hidden-items flags, since private and hidden items are not included in the JSON by default.

@emilio
Copy link
Collaborator

emilio commented Apr 15, 2024

Cool! Curious, do you know how fast / slow rustdoc happens to be compared to syn?

Also, thinking a bit about what cbindgen features we use in Firefox I came up with:

  • The ability to reuse the cargo metadata across cbindgen invocations to speed up the process. We run cargo metadata on the whole workspace, and then call cbindgen a bunch of times with the same metadata, rather than once per crate. This is presumably supported by rustdoc in a way?
    • At least it seems we could use cargo doc --no-deps <crate> to avoid global work for each invocation. But ideally running cbindgen on let's say, 10 crates of the same workspace would be reasonably fast / not scale linearly with the number of crates.
  • The fact that it runs quite fast.
    • At a glance rustdoc seems probably ok speed-wise? But would be useful to have some profiles if we get to implement this before shipping it.
  • The ability to track the dependencies (which files we read) to make incremental builds fast and correct when you haven't touched relevant files.
    • I don't know if rustdoc does something like this. I guess we'd need to assume "any file touched in a crate makes the crate dirty" or something if it does not? That'd be somewhat unfortunate... cc @jschwe who contributed depfile support, iirc.
  • The fact that it doesn't need to compile. This is particularly useful when paired with bindgen (which does require its C++ to compile). This way, for tightly coupled rust / c++ components, you can run cbindgen first to generate the C++ header, and run bindgen on the C++ code.
    • Not sure how well rustdoc deals with e.g. a missing module that's generated via build.rs... Presumably it does need to run build.rs scripts?
    • This is kind of an abuse of cbindgen, but it is really really useful for us. If nobody else uses it we could drop this, I guess.

@obi1kenobi
Copy link

Rustdoc is a component of rustc, so to my knowledge it behaves very similarly: it runs build scripts, resolves dependencies, requires the code to type-check, handles incremental builds, etc.

I've never benchmarked it versus syn but I'd expect it's often slower because syn isn't doing nearly as much work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants