Create IPIP with Gateway spec for partial CAR exports #348

lidel · 2022-03-07T18:35:21Z

Context

ipfs/kubo#8758 adds support for CAR export via Gateway.
It exports entire dag as a CAR stream, which does not cover all use cases.

For example, thin clients may want to export unixfs directory root block + its immediate children, or progressively fetch a big DAG from multiple gateway endpoints.

Why we need selector support

Verifiable HTTP Gateway Responses (Verifiable HTTP Gateway Responses in-web-browsers#128)
- for mobile web browsers (content integrity without battery drain caused by full p2p)
  - mobile browser should be able to traverse huge unixfs directory tree without having to fetch everything (only root block + root blocks of immediate children are needed for generating useful dir listing)
- for IoT devices and other thin clients
  - fetching bigger DAGs progressively, load-balancing/falling back if some gateways are too slow/unreliable – makes HTTP more useful and pushes back the moment when an expensive p2p retrieval has to be spawned

Scope

query param
HTTP header
TBD configurable size budget for CAR stream + UnixFS downloads
TBD allow selectors everywhere? (UnixFS? dag-cbor/json?)

Proposed design (A) 💢

The go-car library supports passing selectors, the idea is to add a parameter to do just that.

We have to URL-escape selector somehow, either way,
so the choice is between encodeURIComponent and multibase encoding:

Text (JSON) representation:

/ipfs/{cid}?format=car&selector.json=encodeURIComponent({json serialization of selector})

Binary (CBOR) representation:

/ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

Proposed design (B) 💢

/ipfs/{cid}?format=car&selector={cid2}

Here {cid2} is a CID representing selector data. It could be dag-cbor, dag-json.
Small ones could be inlined (with identity hash), bigger ones could be fetched once and reused efficiently.

Proposed design (C) 🤏

Over time, we realized this is the most useful and safest way.
No selector CIDs, only predefined, most useful "partial CAR export scope" parameters for now:

/ipfs/{cid}/some/subpath/file?format=car&dag-depth=1&include-path=true

depth=1 means "root+direct children only" – good for fetching UnixFS dir listing with file sizes / types, or splitting bigger DAGs into partial retrievals over multiple gateways / threads
with-path will also include blocks for all parent nodes on the content path (/ipfs/{cid}/some/subpath, /ipfs/{cid}/some, and /ipfs/{cid}) – allows light clients to save round trips and take everything in single request-response.
leaves and bytes proposed by Hannah Create IPIP with Gateway spec for partial CAR exports #348 (comment)

Proposed design (D) 🙏

Better ideas would be really welcome here 👀
Please comment below.

My initial thought was to have "single way of passing selectors", but if you find each approach brings value to different use cases, we could support both.

👉 NOTE: whatever we come up with here, we most likely want Kubo to support the same convention in ipfs dag CLI (and RPC API at /api/v0/dag/*) – see ipfs/kubo#8239

The text was updated successfully, but these errors were encountered:

willscott · 2022-03-07T21:01:09Z

I'm not personally a huge fan of selector.<codec>. I wonder if instead of multibase({cbor serialization of selector}) it could be a cid with identity hash, so specifying codec, and multibase

lidel · 2022-03-08T03:26:56Z

I like the idea of it being a CID!
Small ones could be inlined, bigger ones could be fetched once and reused efficiently.
Added it as (B)

lidel · 2022-03-14T16:57:52Z

Note on cache control: DAG walk implemented by IPLD is deterministic, so we could indicate that response can be cached + (tdb if revalidated in the background).
Note on resuming partial downloads (think: IoT device on poor wifi).
HTTP Range requests require knowing total size of CAR upfront, and we are unable to do that without fetching entire thing first.
- This is why we should have CAR+selector based resume logic in place
- Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

warpfork · 2022-03-14T17:18:00Z

confirm traversal walks (and thus selectors) have a deterministic canonical order (and if that's not easy enough to point at in a specific heading in our specs and docs, that's a bug in the specs and docs).
- ... mind that CAR order is not deterministic per the CAR spec; CARs are just a bag of blocks. But it should be clear enough for some system to itself declare "this CAR must use the standard order" (and in practice right now I think all of our implementations already emit CARs that do so). Just a subtle distinction about who owns that decision, and which things validate or are strict about that.
fwiw, we did get some resumable selector features lately! Implement option to start traversals at a path ipld/go-ipld-prime#358
fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

aschmahmann · 2022-03-14T23:20:02Z

fwiw, we did get some resumable selector features lately! ipld/go-ipld-prime#358

My understanding is that this requires basically stored context on the node you are retrieving from, so is more like extra state for resuming a broken connection than resumable selectors.

fwiw, I think HTTP Range Requests would still be neat to try to support, if possible. I think a "dumb" HTTP cache around an IPFS Gateway being able to support Range requests on a CAR sounds like a nice-to-have. (But this isn't to detract from the comments we should have resumable selectors too, etc.)

IMO range requests for CAR files seems like an iffy thing to support on gateways. In the general case they're costly to create and so asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server, since clients and servers may be run and developed by different parties it wouldn't be great to encourage client developers to build tooling around range requests.

Sometimes they're a good idea, for example IIUC https://github.com/filecoin-project/boost/ plans to allow for ingesting data as CAR files with range requests. However, IIUC they have a few benefits

the user they're downloading the data from must have computed the full CAR file ahead of time anyway (to get a CommP for a Filecoin deal)
the user in any event needs to keep serving the the data indefinitely until the transactions are completed because they are the ones requesting the download
there is a built in expiration time for how long to keep the CAR file around which is "until the user is done uploading it to the relevant providers"

However, I suspect in our case having range requests all the time is a bad idea and having it only some of the time is more likely to cause confusion than not. I'm by no means an expert in the various HTTP tools that exist out there though, so maybe this "sometimes range request" pattern is common enough to be worth supporting.

Q: "entire dag" selector is expensive. should we refuse handling requests with noo selector, and require people to provide one, always + have some predefined ones in docs, like "root+one level deep" before "full dag"?

I don't know that I'd do this long before we put other limits on gateway usage like not downloading 100GB files over public gateways. If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yes, we should definitely have some recipes of common selectors or patterns of use. It's going to be a whole new way of people accessing data and therefore of confusing people. It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={cid2}

This mostly makes sense to me, although there are a few footguns I think we should watch out for here. These aren't blockers and people will hopefully do mostly sane things, but IMO when writing new specs here it's better not to leave too much undefined as then you start having to assume the worst case scenario everywhere.

Sane CID limits, I don't know what the magic number is here, but there's some number. Maybe the number isn't relevant here since URL limits might hit us first, but either way there is going to be some maximum CID size we're allowed. If it's relevant we should document it.
I do think it's nice that unlike just sending the selector as a parameter there's a way to actually do the request even with larger selectors. However, a) magic numbers again, there's probably a maximum size of selector we're willing to deal with and if we don't decide then something else (e.g. the block size limit) will kick in here since IIUC the selector has to be a single block unless we start being able to pass selectors into the selector parameter 😄.
Some consumers of the gateway API will be unable to advertise content which means that actually moving your "slightly too big" selector to a place where it can be consumed by gateway requests might be a big pain.

Perhaps off topic and related to ipfs/in-web-browsers#182, and if so lmk and we can resume there.

@lidel this issue mentions CAR export with a selector like /ipfs/{cid}?format=car&selector.cbor=multibase({cbor serialization of selector})

What happens if it's /ipfs/{cid}/some/path?format=car&selector.cbor=multibase({cbor serialization of selector})? Do we do the path resolution before the selector, or just error?
Is there a reason selector usage has to be restricted to CAR export? Any reason we wouldn't want to do this for regular UnixFS rendering at least for files (i.e. if the output of the selector presents as bytes)? In theory this would then allow you to do something like /ipfs/{cid}?selector.cbor=multibase({cbor selector for an ADL interpretting BitTorrent infohash links as bytes}) and get a result on the gateway. Directories seem potentially more complicated though.

lidel · 2022-03-16T23:54:52Z

asking for bytes 1000MB-1001MB of a CAR file seems like a small request but in reality is very costly on the server

Agree, there is dangerous resource usage asymmetry here, and no clear benefit when compared to progressive download with shallow selectors. I updated ipfs/kubo#8758 – it now returns CAR stream with Accept-Ranges: none to avoid any confusion and incentivize people to use selectors instead.

If we want to allocate some configurable size budget for CAR + UnixFS downloads though that sounds pretty sane to me.

Yep, added to the TBD scope, we may extract it to separate issue.

Yes, we should definitely have some recipes of common selectors or patterns of use. [..] It's possible a few will be so common that it'll be worth considering aliasing them to something easier to read in a URL bar.

/ipfs/{cid}?format=car&selector={s} [..] Do we do the path resolution before the selector [..]

yes

Is there a reason selector usage has to be restricted to CAR export?

no reason to restrict. as we discussed earlier this week, selector could be something we apply to default responses, in which case it would return stream of bytes + we already have means of customizing content-disposition filename for that: /ipfs/{cid}?selector={cid2}&download=true&filename=selector-output.bin

TBD if we want to allow that in this mvp, or add later.

lidel · 2022-03-23T22:44:51Z

This turns out to be more involved, as we are lacking support for dag-json and dag-cbor in various places (e.g. ipfs/go-cid#137, ipfs/kubo#8568). We can't ask users to provide selector CID in any of these formats if we do not support them correctly in our stack.

Blocked until we have dag-cbor and dag-json support story cleaned up in ipfs cid command and go-cid library.

3456091 · 2022-04-03T11:26:41Z

I'm working on a project that will want to use this work around verifiable gateway responses. From the discussion above, am I to understand that resuming downloads of CARs will require parsing the CAR as its downloading, keeping track of the CIDs we want but have yet to receive, then, if the download is interrupted, constructing a new request containing the missing CIDs in a selector?

Especially in the low-powered servers use case, download resumption is going to be important, and if the CAR is to be served with Accept-Ranges: none, I'm curious about how we can address this efficiently.

willscott · 2022-04-03T12:40:24Z

there's some work ongoing for more ergonomic selectors to support parts of this. There's recently been selector support added for representing the blocks that constitute a range of a unixfs file.

@hannahhoward - do you have thoughts on where in go-ipfs we need to respect the unixfs reifier / LargeBytes feature detection to get get the same behavior as in graphsync?

lidel · 2022-07-19T21:52:13Z

In my mind, CAR resumption will not be sending the same request again. The idea is for the client to be smart to import as many blocks as possible, and then send follow-up requests for DAG branches which are missing.

lidel · 2022-07-19T21:53:26Z

Dropping some notes after IPFS Thing 2022:

feels like we may want to do more UX work before we pull the trigger on this one
subjective temperature check: ?selector=<selector-as-dag-json-cid> raises eyebrows, not the best UX-wise
- ?selector= opens pandora's box of allowing arbitrary selectors, so we would only safelist a few initially:
  - root+n-levels deep, n-levels without root, a leaf child along with all parents required for resolving it
  - @mikeal suggested hardcoding common selectors in form of predefined URI params.
    - I think we would need at least ?dag-depth=n to unblock use cases that need shallow CARs (n=1 would fetch only the root+child blocks)
new open questions about adding /ipld/ and ipld:// appeared, and ways we could signal things like ADLs, schemas, and selectors in more intuitive, user-friendly way (cc @RangerMauve)
- one idea was to flesh out IPLD signaling around this new namespace, and then reuse it on /ipfs/ using ?ipld= parameter.

I am afraid this is blocked until we figure out some unified UX strategy for IPLD signaling (selectors, ADLs).

lidel · 2023-01-18T19:14:55Z

A very relevant proposal was presented by @hannahhoward today during 5th Move the Bytes Call.

fetching CARs from /ipfs/ paths optimized around UnixFS domain/application
- non-Unixfs could be fetched as raw blocks, or have own namespace (such as /ipld/ proposed in IPIP-293)
proposed parameters
- response includes blocks from the full path by default
  - /ipfs/cid/a/b includes all blocks for b, but also ones to traverse from cid to b
  - not the current behavior in Kubo/Specs, but we could change it (and/or addd flag that controls the behavior)
- leaves controls if leaves are sent (good for quickly learning about DAG structure, and then fetching leaves in parallel)
  - compliments depth (as we don't always know the depth of a DAG)
  - we may end up having parents and leaves flags, as we should not send parents twice
- bytes=N-M when root CID is a file limits returned blocks to ones that contain requested byte range
another good idea was to end every CAR with a “tombstone”, allowing clients (incl. browser JS ones) to identify when CAR stream ended due to error.
- for CARv1 we could
  - use zero-length raw block, or an identity CID bafkqaaa
  - as a fallback, then tombstone is not present, retry when hash of last block is invalid (means it got truncated) – better than nothing

lidel · 2023-02-10T14:17:52Z

Re: detecting truncated CAR stream, there was a proposal to use CARv2 instead of CARv1, below details so we avoid revisiting it:

if we switch response to CARv2, we can make this less hacky. instead of fake block, more elegant way of doing this is CARv2 and introducing a new Index type:
https://ipld.io/specs/transport/car/carv2/#index-format (it could include things like total count and size of streamed blocks, and act as an explicit tombstone/checksum)
- Downside: index position offset is not known when streaming, which means we need to modify CIDv2 spec and all libraries for v2 to not only support new index type, but also allow index offset to be -1. This is breaking old clients, and using CARv1 feels safer (works everywhere, extra tombstone can be discarded, no breakage of old clients).

BigLep · 2023-02-16T02:12:11Z

As part of Project Rhea, this is critical for improving performance when working with untrusted nodes so we can do better than requesting block-by-block.

Initial design is happening in https://www.notion.so/pl-strflt/HTTP-Gateway-Requests-for-Graphs-as-CARs-001d2a9f5a35418bb0fb7d9d182d24ec?d=8d44d17f00344834b9b72798ca1ea117

vmx · 2023-02-16T12:19:31Z

Public link is https://pl-strflt.notion.site/HTTP-Gateway-Requests-for-Graphs-as-CARs-001d2a9f5a35418bb0fb7d9d182d24ec

lidel added kind/enhancement A net-new feature or an improvement to an existing feature P1 High: Likely tackled by core team if no one steps up effort/hours Estimated to take one or several hours labels Mar 7, 2022

lidel mentioned this issue Mar 7, 2022

IPLD support on Gateways ipfs/in-web-browsers#182

Open

lidel mentioned this issue Mar 8, 2022

Gateway improvements ipfs/in-web-browsers#180

Open

13 tasks

BigLep assigned lidel Mar 8, 2022

This was referenced Mar 8, 2022

feat(gateway): Block and CAR response formats ipfs/kubo#8758

Merged

Gateway: fast check if CID is in local datastore cache (only-if-cached) ipfs/kubo#8783

Closed

thibmeu mentioned this issue Mar 17, 2022

Gateway: DNS resolution export with DNSSEC records ipfs/kubo#8799

Closed

3 tasks

BigLep mentioned this issue Mar 18, 2022

Way to get CIDs of intermediate objects when querying with a path ipfs/kubo#8526

Open

3 tasks

lidel added the status/blocked Unable to be worked further until needs are met label Mar 23, 2022

This was referenced Jun 6, 2022

Add HTTP Gateway Specs #283

Merged

Lightweight RFC Process #286

Closed

Make ipfs.dag.export built-in feature of HTTP gateways ipfs/in-web-browsers#170

Closed

lidel mentioned this issue Jun 20, 2022

Saturn L2 V0 filecoin-saturn/L2-node#22

Merged

lidel added this to the Best Effort Track milestone Jul 19, 2022

lidel transferred this issue from ipfs/kubo Nov 24, 2022

lidel changed the title ~~Gateway: CAR export with selector~~ Gateway: spec for partial CAR export s (selectors?) Nov 24, 2022

lidel changed the title ~~Gateway: spec for partial CAR export s (selectors?)~~ Gateway: spec for partial CAR export (dynamic/predefined selectors?) Dec 14, 2022

This was referenced Jan 19, 2023

IPIP-359: Multi gateway client #359

Draft

IPIP: include ipns-record in Gateway CAR responses #369

Open

lidel changed the title ~~Gateway: spec for partial CAR export (dynamic/predefined selectors?)~~ Create IPIP with Gateway spec for partial CAR exports Jan 24, 2023

olizilla mentioned this issue Feb 10, 2023

validator assumes unixfs encoded blocks web3-storage/pickup#97

Closed

5 tasks

BigLep assigned aschmahmann Feb 16, 2023

lidel mentioned this issue Apr 6, 2023

Rename ?format=car URL params to match IPIP-402 ipfs/bifrost-gateway#80

Closed

14 tasks

lidel mentioned this issue Apr 17, 2023

IPIP-402: Partial CAR Support on Trustless Gateways #402

Merged

lidel closed this as completed in #402 Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create IPIP with Gateway spec for partial CAR exports #348

Create IPIP with Gateway spec for partial CAR exports #348

lidel commented Mar 7, 2022 •

edited

willscott commented Mar 7, 2022

lidel commented Mar 8, 2022 •

edited

lidel commented Mar 14, 2022

warpfork commented Mar 14, 2022

aschmahmann commented Mar 14, 2022

lidel commented Mar 16, 2022

lidel commented Mar 23, 2022 •

edited

3456091 commented Apr 3, 2022

willscott commented Apr 3, 2022

lidel commented Jul 19, 2022

lidel commented Jul 19, 2022 •

edited

lidel commented Jan 18, 2023 •

edited

lidel commented Feb 10, 2023 •

edited

BigLep commented Feb 16, 2023

vmx commented Feb 16, 2023

Create IPIP with Gateway spec for partial CAR exports #348

Create IPIP with Gateway spec for partial CAR exports #348

Comments

lidel commented Mar 7, 2022 • edited

Context

Why we need selector support

Scope

Proposed design (A) 💢

Proposed design (B) 💢

Proposed design (C) 🤏

Proposed design (D) 🙏

willscott commented Mar 7, 2022

lidel commented Mar 8, 2022 • edited

lidel commented Mar 14, 2022

warpfork commented Mar 14, 2022

aschmahmann commented Mar 14, 2022

lidel commented Mar 16, 2022

lidel commented Mar 23, 2022 • edited

3456091 commented Apr 3, 2022

willscott commented Apr 3, 2022

lidel commented Jul 19, 2022

lidel commented Jul 19, 2022 • edited

lidel commented Jan 18, 2023 • edited

lidel commented Feb 10, 2023 • edited

BigLep commented Feb 16, 2023

vmx commented Feb 16, 2023

lidel commented Mar 7, 2022 •

edited

lidel commented Mar 8, 2022 •

edited

lidel commented Mar 23, 2022 •

edited

lidel commented Jul 19, 2022 •

edited

lidel commented Jan 18, 2023 •

edited

lidel commented Feb 10, 2023 •

edited