Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create IPIP with Gateway spec for partial CAR exports #348

Closed
2 of 4 tasks
lidel opened this issue Mar 7, 2022 · 15 comments · Fixed by #402
Closed
2 of 4 tasks

Create IPIP with Gateway spec for partial CAR exports #348

lidel opened this issue Mar 7, 2022 · 15 comments · Fixed by #402
Assignees
Labels
effort/hours Estimated to take one or several hours kind/enhancement A net-new feature or an improvement to an existing feature P1 High: Likely tackled by core team if no one steps up status/blocked Unable to be worked further until needs are met

Comments

@lidel
Copy link
Member

lidel commented Mar 7, 2022

Context

ipfs/kubo#8758 adds support for CAR export via Gateway.
It exports entire dag as a CAR stream, which does not cover all use cases.

For example, thin clients may want to export unixfs directory root block + its immediate children, or progressively fetch a big DAG from multiple gateway endpoints.

Why we need selector support

  • Verifiable HTTP Gateway Responses (Verifiable HTTP Gateway Responses in-web-browsers#128)
    • for mobile web browsers (content integrity without battery drain caused by full p2p)
      • mobile browser should be able to traverse huge unixfs directory tree without having to fetch everything (only root block + root blocks of immediate children are needed for generating useful dir listing)
    • for IoT devices and other thin clients
      • fetching bigger DAGs progressively, load-balancing/falling back if some gateways are too slow/unreliable – makes HTTP more useful and pushes back the moment when an expensive p2p retrieval has to be spawned

Scope

  • query param
  • HTTP header
  • TBD configurable size budget for CAR stream + UnixFS downloads
  • TBD allow selectors everywhere? (UnixFS? dag-cbor/json?)

Proposed design (A) 💢

The go-car library supports passing selectors, the idea is to add a parameter to do just that.

We have to URL-escape selector somehow, either way,
so the choice is between encodeURIComponent and multibase encoding:

Text (JSON) representation:

/ipfs/{cid}?format=car&selector.json=encodeURIComponent({json serialization of selector})

Binary (CBOR) representation:

/ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

Proposed design (B) 💢

/ipfs/{cid}?format=car&selector={cid2}

Here {cid2} is a CID representing selector data. It could be dag-cbor, dag-json.
Small ones could be inlined (with identity hash), bigger ones could be fetched once and reused efficiently.

Proposed design (C) 🤏

Over time, we realized this is the most useful and safest way.
No selector CIDs, only predefined, most useful "partial CAR export scope" parameters for now:

/ipfs/{cid}/some/subpath/file?format=car&dag-depth=1&include-path=true
  • depth=1 means "root+direct children only" – good for fetching UnixFS dir listing with file sizes / types, or splitting bigger DAGs into partial retrievals over multiple gateways / threads
  • with-path will also include blocks for all parent nodes on the content path (/ipfs/{cid}/some/subpath, /ipfs/{cid}/some, and /ipfs/{cid}) – allows light clients to save round trips and take everything in single request-response.
  • leaves and bytes proposed by Hannah Create IPIP with Gateway spec for partial CAR exports #348 (comment)

Proposed design (D) 🙏

Better ideas would be really welcome here 👀
Please comment below.


My initial thought was to have "single way of passing selectors", but if you find each approach brings value to different use cases, we could support both.

👉 NOTE: whatever we come up with here, we most likely want Kubo to support the same convention in ipfs dag CLI (and RPC API at /api/v0/dag/*) – see ipfs/kubo#8239

@lidel lidel added kind/enhancement A net-new feature or an improvement to an existing feature P1 High: Likely tackled by core team if no one steps up effort/hours Estimated to take one or several hours labels Mar 7, 2022
@willscott
Copy link
Contributor

I'm not personally a huge fan of selector.<codec>. I wonder if instead of multibase({cbor serialization of selector}) it could be a cid with identity hash, so specifying codec, and multibase

@lidel
Copy link
Member Author

lidel commented Mar 8, 2022

I like the idea of it being a CID!
Small ones could be inlined, bigger ones could be fetched once and reused efficiently.
Added it as (B)

@lidel
Copy link
Member Author

lidel commented Mar 14, 2022

  • Note on cache control: DAG walk implemented by IPLD is deterministic, so we could indicate that response can be cached + (tdb if revalidated in the background).

  • Note on resuming partial downloads (think: IoT device on poor wifi).
    HTTP Range requests require knowing total size of CAR upfront, and we are unable to do that without fetching entire thing first.

    • This is why we should have CAR+selector based resume logic in place
    • Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

@warpfork
Copy link
Member

  • confirm traversal walks (and thus selectors) have a deterministic canonical order (and if that's not easy enough to point at in a specific heading in our specs and docs, that's a bug in the specs and docs).
    • ... mind that CAR order is not deterministic per the CAR spec; CARs are just a bag of blocks. But it should be clear enough for some system to itself declare "this CAR must use the standard order" (and in practice right now I think all of our implementations already emit CARs that do so). Just a subtle distinction about who owns that decision, and which things validate or are strict about that.
  • fwiw, we did get some resumable selector features lately! Implement option to start traversals at a path ipld/go-ipld-prime#358
  • fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

@aschmahmann
Copy link
Contributor

fwiw, we did get some resumable selector features lately! ipld/go-ipld-prime#358

My understanding is that this requires basically stored context on the node you are retrieving from, so is more like extra state for resuming a broken connection than resumable selectors.

fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

IMO range requests for CAR files seems like an iffy thing to support on gateways. In the general case they're costly to create and so asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server, since clients and servers may be run and developed by different parties it wouldn't be great to encourage client developers to build tooling around range requests.

Sometimes they're a good idea, for example IIUC https://github.com/filecoin-project/boost/ plans to allow for ingesting data as CAR files with range requests. However, IIUC they have a few benefits

  1. the user they're downloading the data from must have computed the full CAR file ahead of time anyway (to get a CommP for a Filecoin deal)
  2. the user in any event needs to keep serving the the data indefinitely until the transactions are completed because they are the ones requesting the download
  3. there is a built in expiration time for how long to keep the CAR file around which is "until the user is done uploading it to the relevant providers"

However, I suspect in our case having range requests all the time is a bad idea and having it only some of the time is more likely to cause confusion than not. I'm by no means an expert in the various HTTP tools that exist out there though, so maybe this "sometimes range request" pattern is common enough to be worth supporting.

Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

I don't know that I'd do this long before we put other limits on gateway usage like not downloading 100GB files over public gateways. If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yes, we should definitely have some recipes of common selectors or patterns of use. It's going to be a whole new way of people accessing data and therefore of confusing people. It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={cid2}

This mostly makes sense to me, although there are a few footguns I think we should watch out for here. These aren't blockers and people will hopefully do mostly sane things, but IMO when writing new specs here it's better not to leave too much undefined as then you start having to assume the worst case scenario everywhere.

  1. Sane CID limits, I don't know what the magic number is here, but there's some number. Maybe the number isn't relevant here since URL limits might hit us first, but either way there is going to be some maximum CID size we're allowed. If it's relevant we should document it.
  2. I do think it's nice that unlike just sending the selector as a parameter there's a way to actually do the request even with larger selectors. However, a) magic numbers again, there's probably a maximum size of selector we're willing to deal with and if we don't decide then something else (e.g. the block size limit) will kick in here since IIUC the selector has to be a single block unless we start being able to pass selectors into the selector parameter 😄.
  3. Some consumers of the gateway API will be unable to advertise content which means that actually moving your "slightly too big" selector to a place where it can be consumed by gateway requests might be a big pain.

Perhaps off topic and related to ipfs/in-web-browsers#182, and if so lmk and we can resume there.

@lidel this issue mentions CAR export with a selector like /ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

  1. What happens if it's /ipfs/{cid}/some/path?format=car&selector.cbor=multibase({cbor serialization of selector})? Do we do the path resolution before the selector, or just error?
  2. Is there a reason selector usage has to be restricted to CAR export? Any reason we wouldn't want to do this for regular UnixFS rendering at least for files (i.e. if the output of the selector presents as bytes)? In theory this would then allow you to do something like /ipfs/{cid}?selector.cbor=multibase({cbor selector for an ADL interpretting BitTorrent infohash links as bytes}) and get a result on the gateway. Directories seem potentially more complicated though.

@lidel
Copy link
Member Author

lidel commented Mar 16, 2022

asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server

Agree, there is dangerous resource usage asymmetry here, and no clear benefit when compared to progressive download with shallow selectors. I updated ipfs/kubo#8758 – it now returns CAR stream with Accept-Ranges: none to avoid any confusion and incentivize people to use selectors instead.

If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yep, added to the TBD scope, we may extract it to separate issue.

Yes, we should definitely have some recipes of common selectors or patterns of use. [..] It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={s} [..] Do we do the path resolution before the selector [..]

yes

Is there a reason selector usage has to be restricted to CAR export?

no reason to restrict. as we discussed earlier this week, selector could be something we apply to default responses, in which case it would return stream of bytes + we already have means of customizing content-disposition filename for that: /ipfs/{cid}?selector={cid2}&download=true&filename=selector-output.bin

TBD if we want to allow that in this mvp, or add later.

@lidel
Copy link
Member Author

lidel commented Mar 23, 2022

This turns out to be more involved, as we are lacking support for dag-json and dag-cbor in various places (e.g. ipfs/go-cid#137, ipfs/kubo#8568). We can't ask users to provide selector CID in any of these formats if we do not support them correctly in our stack.

Blocked until we have dag-cbor and dag-json support story cleaned up in ipfs cid command and go-cid library.

@lidel lidel added the status/blocked Unable to be worked further until needs are met label Mar 23, 2022
@3456091
Copy link

3456091 commented Apr 3, 2022

I'm working on a project that will want to use this work around verifiable gateway responses. From the discussion above, am I to understand that resuming downloads of CARs will require parsing the CAR as its downloading, keeping track of the CIDs we want but have yet to receive, then, if the download is interrupted, constructing a new request containing the missing CIDs in a selector?

Especially in the low-powered servers use case, download resumption is going to be important, and if the CAR is to be served with Accept-Ranges: none, I'm curious about how we can address this efficiently.

@willscott
Copy link
Contributor

there's some work ongoing for more ergonomic selectors to support parts of this. There's recently been selector support added for representing the blocks that constitute a range of a unixfs file.

@hannahhoward - do you have thoughts on where in go-ipfs we need to respect the unixfs reifier / LargeBytes feature detection to get get the same behavior as in graphsync?

@lidel
Copy link
Member Author

lidel commented Jul 19, 2022

In my mind, CAR resumption will not be sending the same request again. The idea is for the client to be smart to import as many blocks as possible, and then send follow-up requests for DAG branches which are missing.

@lidel
Copy link
Member Author

lidel commented Jul 19, 2022

Dropping some notes after IPFS Thing 2022:

  • feels like we may want to do more UX work before we pull the trigger on this one
  • subjective temperature check: ?selector=<selector-as-dag-json-cid> raises eyebrows, not the best UX-wise
    • ?selector= opens pandora's box of allowing arbitrary selectors, so we would only safelist a few initially:
      • root+n-levels deep, n-levels without root, a leaf child along with all parents required for resolving it
      • @mikeal suggested hardcoding common selectors in form of predefined URI params.
        • I think we would need at least ?dag-depth=n to unblock use cases that need shallow CARs (n=1 would fetch only the root+child blocks)
  • new open questions about adding /ipld/ and ipld:// appeared, and ways we could signal things like ADLs, schemas, and selectors in more intuitive, user-friendly way (cc @RangerMauve)
    • one idea was to flesh out IPLD signaling around this new namespace, and then reuse it on /ipfs/ using ?ipld= parameter.

I am afraid this is blocked until we figure out some unified UX strategy for IPLD signaling (selectors, ADLs).

@lidel lidel added this to the Best Effort Track milestone Jul 19, 2022
@lidel lidel transferred this issue from ipfs/kubo Nov 24, 2022
@lidel lidel changed the title Gateway: CAR export with selector Gateway: spec for partial CAR export s (selectors?) Nov 24, 2022
@lidel lidel changed the title Gateway: spec for partial CAR export s (selectors?) Gateway: spec for partial CAR export (dynamic/predefined selectors?) Dec 14, 2022
@lidel
Copy link
Member Author

lidel commented Jan 18, 2023

A very relevant proposal was presented by @hannahhoward today during 5th Move the Bytes Call.

  • fetching CARs from /ipfs/ paths optimized around UnixFS domain/application
    • non-Unixfs could be fetched as raw blocks, or have own namespace (such as /ipld/ proposed in IPIP-293)
  • proposed parameters
    • response includes blocks from the full path by default
      • /ipfs/cid/a/b includes all blocks for b, but also ones to traverse from cid to b
      • not the current behavior in Kubo/Specs, but we could change it (and/or addd flag that controls the behavior)
    • leaves controls if leaves are sent (good for quickly learning about DAG structure, and then fetching leaves in parallel)
      • compliments depth (as we don't always know the depth of a DAG)
      • we may end up having parents and leaves flags, as we should not send parents twice
    • bytes=N-M when root CID is a file limits returned blocks to ones that contain requested byte range
  • another good idea was to end every CAR with a “tombstone”, allowing clients (incl. browser JS ones) to identify when CAR stream ended due to error.
    • for CARv1 we could
      • use zero-length raw block, or an identity CID bafkqaaa
      • as a fallback, then tombstone is not present, retry when hash of last block is invalid (means it got truncated) – better than nothing

@lidel lidel changed the title Gateway: spec for partial CAR export (dynamic/predefined selectors?) Create IPIP with Gateway spec for partial CAR exports Jan 24, 2023
@lidel
Copy link
Member Author

lidel commented Feb 10, 2023

Re: detecting truncated CAR stream, there was a proposal to use CARv2 instead of CARv1, below details so we avoid revisiting it:

  • if we switch response to CARv2, we can make this less hacky. instead of fake block, more elegant way of doing this is CARv2 and introducing a new Index type:
    https://ipld.io/specs/transport/car/carv2/#index-format (it could include things like total count and size of streamed blocks, and act as an explicit tombstone/checksum)
    • Downside: index position offset is not known when streaming, which means we need to modify CIDv2 spec and all libraries for v2 to not only support new index type, but also allow index offset to be -1. This is breaking old clients, and using CARv1 feels safer (works everywhere, extra tombstone can be discarded, no breakage of old clients).

@BigLep
Copy link
Contributor

BigLep commented Feb 16, 2023

As part of Project Rhea, this is critical for improving performance when working with untrusted nodes so we can do better than requesting block-by-block.

Initial design is happening in https://www.notion.so/pl-strflt/HTTP-Gateway-Requests-for-Graphs-as-CARs-001d2a9f5a35418bb0fb7d9d182d24ec?d=8d44d17f00344834b9b72798ca1ea117

@vmx
Copy link
Member

vmx commented Feb 16, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/hours Estimated to take one or several hours kind/enhancement A net-new feature or an improvement to an existing feature P1 High: Likely tackled by core team if no one steps up status/blocked Unable to be worked further until needs are met
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants