Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust message parsing #106

Merged
merged 1 commit into from May 9, 2020
Merged

Robust message parsing #106

merged 1 commit into from May 9, 2020

Conversation

matklad
Copy link
Contributor

@matklad matklad commented Apr 4, 2020

would appreciate a release with this one!

@matklad
Copy link
Contributor Author

matklad commented Apr 4, 2020

See also upstream issue: serde-rs/json#647

Though, because cargo guarantees ndjson format, I feel like this lines-based impl is actually more proper.

@matklad matklad force-pushed the robust-streams branch 2 times, most recently from 27bbd83 to 1a3220f Compare April 4, 2020 10:20
@ehuss
Copy link
Contributor

ehuss commented Apr 4, 2020

See also #102 (comment) for a discussion about this. I personally think it is better to check if the line starts with { and pass-through all others. Unfortunately Cargo doesn't say when to stop parsing, so things like cargo test and cargo run can be tricky, though fixing that would be easy. Also, Cargo itself doesn't emit json errors (something I've wanted to fix for a long time), so mapping to a io::Error means you lose the actual message, which is probably not a pattern I would want to encourage.

@matklad
Copy link
Contributor Author

matklad commented Apr 4, 2020

so mapping to a io::Error means you lose the actual message, which is probably not a pattern I would want to encourage.

I believe the message is preserved, and that the original serde error can be recovered via downcasting as well:

Err(err) => {
    eprintln!("err = {:?}", err);
    eprintln!("err = {}", err);
    let err = err.get_ref().unwrap().downcast_ref::<serde_json::Error>();
    eprintln!("err = {:?}", err);
    bad += 1
},
-------
running 1 test
err = Custom { kind: InvalidData, error: Error("expected value", line: 1, column: 1) }
err = expected value at line 1 column 1
err = Some(Error("expected value", line: 1, column: 1))
test parse_stream_is_robust ... ok

This does sound like a perfect use-case for io::ErrorKind::InvalidData.

Checking for { is only optimization I think? Like, proc macro can print {}, and we'd still fail to deserialize it as Message. So i'd rather just lean of serde's error reporting here.

Separating messages from other output properly is an interesting, hard problem, which I think doesn't have a perfect solution. Like, you can't even assume utf-8 encoding. So I'd rather keep this API as the ideal solution for dealing with cargo check / cargo build, and, for anything more complex, leave the implementation to the consumer. They probably will need a really complicated machinery for full generality, like the infamous read2 copy.

@matklad
Copy link
Contributor Author

matklad commented Apr 4, 2020

More specifically, I can imagine a tad more general interface, which returns an io::Result<Either<Message, String>>, but I believe that:

  • it will make check & build cases more complicated without any benefit
  • it won't solve the run case, as, for run, you'd need to handle stderr (and non-utf8 output).
  • it won't be a part of perfect solution for the run case, as that involves read2 to drain both stdout and stderr, and read2 interface (and, in general, the whole posix-style IO) doesn't compose well with blocking iterators

@ehuss
Copy link
Contributor

ehuss commented Apr 5, 2020

Checking for { is a convenience to differentiate between "this is definitely not JSON" and "this is maybe JSON, but didn't parse correctly", where in the former case should just be displayed as-is to the user, whereas the later you may want to show an error.

I don't see how to get the original line from the error object.

I'm not sure why it needs to be particularly complex or specialized. It just needs to return a message or a string. I think rustc's bootstrap is a good use case example (code), and shows the code doesn't need to be too complex.

BTW, I posted rust-lang/cargo#8069 as a solution to know when to stop parsing the output. Would appreciate if you have any comments on it.

@matklad
Copy link
Contributor Author

matklad commented Apr 5, 2020

I don't see how to get the original line from the error object.

Ah, sorry, I've completely misunderstood what do you mean by " you lose the actual message", I was thinking about serde's error message.

It just needs to return a message or a string.

Hm, what Item type do you envision for the Iterator then? I think it has to be io::Result<Something>, as we need to report the outer reader errors. For something, we need to handle three possibilities:

  1. It's definitelly not a JSON message
  2. It's a JSON message
  3. It looks like JSON message, but fails to parse.

And in the last case we can't re-use serde_json::Error as is, as it doesn't contain the original string. That gives us something like this:

io::Result<Result<Message, (String, Option<serde_json::Error>)>>

I don't see nice ways to simplify that substantionaly. It seems like just writing the loop manually as is done in the rustc's bootstrap code might actually be more readable.

Writing this all down, I am not actually sure if this streamp-parsing functionality even should exist in this crate, but, if it does, I think it should use the line splitting approach.

@roblabla
Copy link
Contributor

roblabla commented Apr 16, 2020

What's the motivation for this? Does cargo ever send us non-JSON output? Shouldn't that be considered a cargo bug and fixed there, instead of awkwardly worked around here?

EDIT: NVM, it's linked to #102.

I'm asking because I know cargo build can also create non-JSON messages if proc macros call println, which I'm not sure if it should be considered a cargo bug.

@matklad
Copy link
Contributor Author

matklad commented Apr 16, 2020

1a3220f#diff-8294540e38a75a7ad333709397296910R480-R481

We hit this in real world with rust-analyzer

@oli-obk
Copy link
Owner

oli-obk commented Apr 29, 2020

I'm sorry, I'm not sure I get what's going on in this discussion. I seem to be missing information from other places.

@ehuss @roblabla do you have any concerns about this impl that will be problematic for us in the future? If so, do you have any concrete modification suggestions to this PR?

@ehuss
Copy link
Contributor

ehuss commented Apr 29, 2020

Yea, sorry, I've been meaning to respond. I'll try to come up with a concrete suggestion soon.

@roblabla
Copy link
Contributor

I've been thinking about this for a bit. I think this PR is fine for inclusion as-is. There are two separate problems with cargo-metadata and JSON parsing that it helps with:

  1. During the cargo build, there will sometimes be non-JSON output interspersed in stdout, despite using message-format=json. IMO this is a cargo/rustc bug - when --message-format=json is passed to cargo, stdout shouldn't have any non-JSON output at least during the build. I opened a cargo issue, Proc macros can cause non-json output in stdout despite message-format=json rust-lang/cargo#8179

    Still, even though it's a cargo bug, having this PR in is a good idea: it allows us to gracefully recover when a stray print happens. It work on cargo as it exists today, it avoids problems if cargo regresses later, and it allows cargo-metadata to work on outdated cargo versions when the problem is fixed.

    The current PR doesn't provide a mechanism to recover the message that failed to parse as JSON. IMO this is fine, since ultimately this is caused by a cargo bug that needs to be fixed upstream. I don't think the lost message will cause all that much problem - I'm highly doubtful there's any tools relying on the stdout output of proc-macros.

  2. After the build is done, there isn't any guarantee that the stdout will contain JSON, e.g. when doing cargo run. At this point, the rest of the data can be just about anything. This is linked to Error deserialising output of cargo test --doc #102.

    In this use-case, this PR avoids a crash, but it doesn't provide a mechanism to recover the data that failed to parse to JSON, which might be necessary for some tools to parse. For instance, the rest of the data could be another JSON stream (for instance, when running cargo test -- --format json) or it could be binary data/invalid utf8, etc...

    Attempting to recover the "lost" message in the error type here doesn't make sense. The output might not be a line, it might be of arbitrary size, it might be invalid utf8.

    To help deal with this, ehuss added a build-finished message type to cargo (Add "build-finished" JSON message. rust-lang/cargo#8069), signaling the end of a cargo build and the rest of the output stream is arbitrary. IMO, the ideal solution would be for cargo-metadata to stop deserializing messages after the marker, and allow the user to extract the std::io::Read instance passed to parse_stream. The user would then be free to do what they need to do with the std::io::Read (write the rest of the stream to disk, parse it, etc...).

    This PR lays some groundwork for this. First of all, parse_stream introduces its own type that can have arbitrary logic, so it allows us to add some logic to stop parsing messages after the build-finished marker is reached, and adding our own functions to recover the BufRead after iteration is done, without introducing a breaking change.

So, no, I don't think this PR is problematic.

@ehuss
Copy link
Contributor

ehuss commented May 8, 2020

Taking another look at this, what do you think about just adding another variant to Message that indicates a raw stdout message? That can be used if the line is not valid JSON.

@matklad matklad force-pushed the robust-streams branch 3 times, most recently from e60e52c to 78e9115 Compare May 8, 2020 21:49
@matklad
Copy link
Contributor Author

matklad commented May 8, 2020

@ehuss I like your solution, implemented it!

The drawback is that we now can in theory mask a serde error, but this bug seems unlikely in practice, and, given that we control both producer and consumer of messages here, shouldn't be hard to fix.

@@ -110,23 +110,51 @@ pub enum Message {
/// This is emitted at the end of the build as the last message.
/// Added in Rust 1.44.
BuildFinished(BuildFinished),
/// A line of text which isn't a cargo or compiler message.
/// Line separator is not included
TextLine(String),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply #[serde(skip)] to this variant, we should never see a text-line from cargo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, just slapping skip broke serde: matklad@0b1c8d8

I think it intrracts badly with #[serde(other)] below. I'll try to read on other and figure out how to convince serde do what we want to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that was genuine serde bug: serde-rs/serde#1804

Bumped serde-derive req in Cargo.toml

@oli-obk oli-obk merged commit c6f7e8a into oli-obk:master May 9, 2020
@oli-obk
Copy link
Owner

oli-obk commented May 9, 2020

Published 0.10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants