Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify HTTP Car Requests #195

Closed
hannahhoward opened this issue Apr 21, 2023 · 4 comments
Closed

Verify HTTP Car Requests #195

hannahhoward opened this issue Apr 21, 2023 · 4 comments
Assignees

Comments

@hannahhoward
Copy link
Collaborator

Goals

Building on #193 , add verification of CAR blocks as we download.

Implementation

My recommendation for implementation:

  • open a linear reader on the incoming CAR file from the response body
  • start a selector traversal based on the parameters of the request
    • for the block read opener:
      • each time it is called, read the next block from the CARv1 stream (obviously a blocking read if not data is not yet available from the HTTP response)
      • hash it, verify the bytes match the CID passed to the block read opener
        • if matched, write the block in the into the outgoing HTTP CarV1 from Lassie and return the bytes as a reader to the selector traversal (TeeReader probably works well here)
        • if not matched, error
@rvagg
Copy link
Member

rvagg commented Apr 24, 2023

The way I imagined this being done is similar to how I did the go-car stdin extractor: https://github.com/ipld/go-car/blob/3476971d97cd992991ade75087d9273adcf659e6/cmd/car/extract.go#L378-L444

But the big catch with this approach is buffering.

  1. Currently that code doesn't delete used blocks, I don't think it can because I think it's possible for a Get() to be called for the same CID during a UnixFS unpack; but in our case I think we can ditch blocks as we've read & checked them.

  2. Out-of-order blocks are a big problem that we need to figure out how to deal with. If an upstream provider is giving us blocks in the wrong order, we either need to reject it as "malformed" (I'd love to be able to do this but I can imagine resistance to such an approach), or we need to buffer the blocks until we get the next one we expect. If we allow out-of-order responses then we can reformat the CAR we're sending out to the user based on our own traversal logic so they're always well-formed regardless of the upstream provider, but there's an OOM risk involved in doing this.

  3. What's the current status of discussion on well-formedness of HTTP CARs @willscott? I was hoping that we'd get to a spec with some really clear rules and associated strictness around ordering such that we could even have fixtures that resolve to output CARs that are byte-for-byte perfect matches. It's one of the reasons I was working on feat(test): file/dir name generator that's more realistic #185, to get to some nice complex but shareable fixtures for the various cases.

@willscott
Copy link
Contributor

Ideally this should use the same validation that the bifrost-gateway uses.
I think that's happening in ipfs/bifrost-gateway#75

@rvagg rvagg mentioned this issue Apr 26, 2023
@willscott
Copy link
Contributor

coming back to elaborate on 2/ 3:

My understanding of the intention of the trustless gateway spec is to require a strict traversal ordering. That should remove potential for out of order blocks / should allow for deterministic / byte-for-byte output matches.

@aschmahmann
Copy link

I think that's happening in ipfs/bifrost-gateway#75

I'm not planning to handle this in that PR, but that'd be a follow up (see ipfs/bifrost-gateway#62, it's currently listed as step 4 but maybe 5 will end up coming first).

My understanding of the intention of the trustless gateway spec is to require a strict traversal ordering. That should remove potential for out of order blocks

This should make verification simpler to implement. idk if users should be allowed to ask for more data + hope gzip saves them in order to reduce memory usage from buffering as Rod alluded to. Something to figure out in the spec PR for what people want here.

should allow for deterministic / byte-for-byte output matches.

That's not necessarily true, there are a few more things that might need to be specified beyond traversal order see ipfs/specs#402 (comment) and the related thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants