Skip to content

Commit

Permalink
Merge pull request #288 from ipfs/feat/gateway-tar
Browse files Browse the repository at this point in the history
IPIP-288: TAR Gateway Response Format
  • Loading branch information
lidel committed Nov 9, 2022
2 parents c839886 + 8fe745a commit 4cfabca
Show file tree
Hide file tree
Showing 2 changed files with 166 additions and 14 deletions.
141 changes: 141 additions & 0 deletions IPIP/0288-gateway-tar-response-format.md
@@ -0,0 +1,141 @@
# IPIP-288: TAR Response Format on HTTP Gateways

- Start Date: 2022-06-10
- Related Issues:
- [ipfs/specs/pull/288](https://github.com/ipfs/specs/pull/288)
- [ipfs/go-ipfs/pull/9029](https://github.com/ipfs/go-ipfs/pull/9029)
- [ipfs/go-ipfs/pull/9034](https://github.com/ipfs/go-ipfs/pull/9034)

## Summary

Add TAR response format to the [HTTP Gateway](../http-gateways/).

## Motivation

Currently, the HTTP Gateway only allows for UnixFS deserialization of a single
UnixFS file. Directories have to be downloaded one file at a time, using
multiple requests, or as a CAR, which requires deserialization in userland,
via additional tools like [ipfs-car](https://www.npmjs.com/package/ipfs-car).

This is to illustrate we have a functional gap where user is currently unable
to leverage trusted HTTP gateway for deserializing UnixFS directory tree. We
would like to remove the need for dealing with CARs when a gateway is trusted
(e.g., a localhost gateway).

An example use case is for the IPFS Web UI, which currently allows users to
download directories using a workaround. This workaround works via a proprietary
Kubo RPC API that only supports `POST` requests and the Web UI has to store the entire
directory in memory before the user can download it.

By introducing TAR responses on the HTTP Gateway, we provide vendor-agnosic way
of downloading entire directories in deserialized form, which increases utility
and interop provided by HTTP gateways.

## Detailed design

The solution is to allow the Gateway to support producing TAR archives
by requesting them using either the `Accept` HTTP header or the `format`
URL query.

## Test fixtures

Existing `curl` and `tar` tools can be used by implementers for testing.

Providing static test vectors has little value here, as different TAR libraries
may produce different byte-to-byte files due to unspecified ordering of files and
directories inside.

However, there are certain behaviors, detailed in the [security section](#security)
that should be handled. To test such behaviors, the following fixtures can be used:

- [`bafybeibfevfxlvxp5vxobr5oapczpf7resxnleb7tkqmdorc4gl5cdva3y`][inside-dag]
is a UnixFS DAG that contains a file with a name that looks like a relative
path that points inside the root directory. Downloading it as a TAR must
work.

- [`bafkreict7qp5aqs52445bk4o7iuymf3davw67tpqqiscglujx3w6r7hwoq`][inside-dag-tar]
is an example TAR file that corresponds to the aforementioned UnixFS DAG. Its
structure can be inspected in order to check if new implementations conform
to the specification.

- [`bafybeicaj7kvxpcv4neaqzwhrqqmdstu4dhrwfpknrgebq6nzcecfucvyu`][outside-dag]
is a UnixFS DAG that contains a file with a name that looks like a relative
path that points outside the root directory. Downloading it as a TAR must
error.

## Design rationale

The current gateway already supports different response formats via the
`Accept` HTTP header and the `format` URL query. This IPIP proposes adding
one more supported format to that list.

### User benefit

Users will be able to directly download deserialized UnixFS directories from
the gateway. Having a single TAR stream is saving resources on both client and
HTTP server, and removes complexity related to redundant buffering or CAR
deserialization when gateway is trusted.

In the Web UI, for example, we will be able to create a direct link to download
a directory, instead of using the API to put the whole file in memory before
downloading it.

CLI users will be able to download a directory with existing tools like `curl` and `tar` without
having to talk to implementation-specific RPC APIs like `/api/v0/get` from Kubo.

Fetching a directory from a local gateway will be as simple as:

```console
$ export DIR_CID=bafybeigccimv3zqm5g4jt363faybagywkvqbrismoquogimy7kvz2sj7sq
$ curl "http://127.0.0.1:8080/ipfs/$DIR_CID?format=tar" | tar xv
bafybeigccimv3zqm5g4jt363faybagywkvqbrismoquogimy7kvz2sj7sq
bafybeigccimv3zqm5g4jt363faybagywkvqbrismoquogimy7kvz2sj7sq/1 - Barrel - Part 1 - alt.txt
bafybeigccimv3zqm5g4jt363faybagywkvqbrismoquogimy7kvz2sj7sq/1 - Barrel - Part 1 - transcript.txt
bafybeigccimv3zqm5g4jt363faybagywkvqbrismoquogimy7kvz2sj7sq/1 - Barrel - Part 1.png
```

### Compatibility

This IPIP is backwards compatible: adds a new opt-in response type, does not
modify preexisting behaviors.

Existing content type `application/x-tar` is used when request is made with an `Accept` header.

### Security

Third-party UnixFS file names may include unexpected values, such as `../`.

Manually created UnixFS DAGs can be turned into malicious TAR files. For example,
if a UnixFS directory contains a file that points at a relative path outside
its root, the unpacking of the TAR file may overwrite local files outside the expected
destination.

In order to prevent this, the specification requires implementations to do
basic sanitization of paths returned inside a TAR response.

If the UnixFS directory contains a file whose path
points outside the root, the TAR file download **should** fail by force-closing
the HTTP connection, leading to a network error.

To test this, we provide some [test fixtures](#test-fixtures). The user should be
suggested to use a CAR file if they want to download the raw files.

### Alternatives

One discussed alternative would be to support uncompressed ZIP files. However,
TAR and TAR-related libraries are already supported by some IPFS
implementations, and are easier to work with in CLI. TAR provides simpler
abstraction, and layering compression on top of TAR stream allows for greater
flexibility than alternative options that come with own, opinionated approaches
to compression.

In addition, we considered supporting [Gzipped TAR](https://github.com/ipfs/go-ipfs/pull/9034) out of the box,
but decided against it as gzip or alternative compression may be introduced on the HTTP transport layer.

### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).

[inside-dag]: https://dweb.link/ipfs/bafybeibfevfxlvxp5vxobr5oapczpf7resxnleb7tkqmdorc4gl5cdva3y?format=car
[inside-dag-tar]: https://dweb.link/ipfs/bafkreict7qp5aqs52445bk4o7iuymf3davw67tpqqiscglujx3w6r7hwoq?format=car
[outside-dag]: https://dweb.link/ipfs/bafybeicaj7kvxpcv4neaqzwhrqqmdstu4dhrwfpknrgebq6nzcecfucvyu?format=car
39 changes: 25 additions & 14 deletions http-gateways/PATH_GATEWAY.md
@@ -1,6 +1,6 @@
# Path Gateway Specification

![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square)
![reliable](https://img.shields.io/badge/status-reliable-green.svg?style=flat-square)

**Authors**:

Expand Down Expand Up @@ -181,6 +181,7 @@ For example:

- [application/vnd.ipld.raw](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw) – disables [IPLD codec deserialization](https://ipld.io/docs/codecs/), requests a verifiable raw [block](https://docs.ipfs.io/concepts/glossary/#block) to be returned
- [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car) – disables [IPLD codec deserialization](https://ipld.io/docs/codecs/), requests a verifiable [CAR](https://docs.ipfs.io/concepts/glossary/#car) stream to be returned
- [application/x-tar](https://en.wikipedia.org/wiki/Tar_(computing)) – returns UnixFS tree (files and directories) as a [TAR](https://en.wikipedia.org/wiki/Tar_(computing)) stream. Returned tree starts at a root item which name is the same as the requested CID. Produces 400 Bad Request for content that is not UnixFS.
<!-- TODO: https://github.com/ipfs/go-ipfs/issues/8823
- application/vnd.ipld.dag-json OR application/json – requests IPLD Data Model representation serialized into [DAG-JSON format](https://ipld.io/docs/codecs/known/dag-json/)
- application/vnd.ipld.dag-cbor OR application/cbor - requests IPLD Data Model representation serialized into [DAG-CBOR format](https://ipld.io/docs/codecs/known/dag-cbor/)
Expand All @@ -194,7 +195,6 @@ blocks.
Gateway implementations SHOULD be smart enough to require only the minimal DAG subset
necessary for handling the range request.


NOTE: for more advanced use cases such as partial DAG/CAR streaming, or
non-UnixFS data structures, see the `selector` query parameter
[proposal](https://github.com/ipfs/go-ipfs/issues/8769).
Expand Down Expand Up @@ -250,13 +250,14 @@ This is a URL-friendly alternative to sending
`Accept: application/vnd.ipld.<format>` header, see [`Accept`](#accept-request-header)
for more details.

In case of `Accept: application/x-tar`, the `?format=` equivalent is `tar`.

<!-- TODO Planned: https://github.com/ipfs/go-ipfs/issues/8769
- `selector=<cid>` can be used for passing a CID with [IPLD selector](https://ipld.io/specs/selectors)
- Selector should be in dag-json or dag-cbor format
- This is a powerful primitive that allows for fetching subsets of data in specific order, either as raw bytes, or a CAR stream. Think “HTTP range requests”, but for IPLD, and more powerful.
-->


# HTTP Response

## Response Status Codes
Expand Down Expand Up @@ -354,7 +355,7 @@ and CDNs, implementations should base it on both CID and response type:

- By default, etag should be based on requested CID. Example: `Etag: "bafy…foo"`

- If a custom `format` was requested (such as a raw block or a CAR), the
- If a custom `format` was requested (such as a raw block, CAR), the
returned etag should be modified to include it. It could be a suffix.
- Example: `Etag: "bafy…foo.raw"`

Expand All @@ -365,14 +366,16 @@ and CDNs, implementations should base it on both CID and response type:
- Example: `Etag: "DirIndex-2B423AF_CID-bafy…foo"`

- When a gateway can’t guarantee byte-for-byte identical responses, a “weak”
etag should be used. For example, if CAR is streamed, and blocks arrive in
non-deterministic order, the response should have `Etag: W/"bafy…foo.car"`
etag should be used.
- Example: If CAR is streamed, and blocks arrive in non-deterministic order,
the response should have `Etag: W/"bafy…foo.car"`.
- Example: If TAR stream is generated by traversing an UnixFS directory in non-deterministic
order, the response should have `Etag: W/"bafy…foo.x-tar"`.

- When responding to [`Range`](#range-request-header) request, a strong `Etag`
should be based on requested range in addition to CID and response format:
`Etag: "bafy..foo.0-42`


### `Cache-Control` (response header)

Used for HTTP caching.
Expand Down Expand Up @@ -433,6 +436,7 @@ or optional [`filename`](#filename-request-query-parameter) parameter)
and magic bytes to improve the utility of produced responses.

For example:

- detect plain text file
and return `Content-Type: text/plain` instead of `application/octet-stream`
- detect SVG image
Expand All @@ -446,6 +450,7 @@ Returned when `download`, `filename` query parameter, or a custom response
The first parameter passed in this header indicates if content should be
displayed `inline` by the browser, or sent as an `attachment` that opens the
“Save As” dialog:

- `Content-Disposition: inline` is the default, returned when request was made
with `download=false` or a custom `filename` was provided with the request
without any explicit `download` parameter.
Expand All @@ -457,13 +462,14 @@ The remainder is an optional `filename` parameter that will be prefilled in the

NOTE: when the `filename` includes non-ASCII characters, the header must
include both ASCII and UTF-8 representations for compatibility with legacy user
agents and existing web browsers.
agents and existing web browsers.

To illustrate, `?filename=testтест.pdf` should produce:
`Content-Disposition inline; filename="test____.jpg"; filename*=UTF-8''test%D1%82%D0%B5%D1%81%D1%82.jpg`
- ASCII representation must have non-ASCII characters replaced with `_`
- UTF-8 representation must be wrapped in Percent Encoding ([RFC 3986, Section 2.1](https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1)).
- NOTE: `UTF-8''` is not a typo – see [Examples in RFC5987](https://datatracker.ietf.org/doc/html/rfc5987#section-3.2.2)

- ASCII representation must have non-ASCII characters replaced with `_`
- UTF-8 representation must be wrapped in Percent Encoding ([RFC 3986, Section 2.1](https://www.rfc-editor.org/rfc/rfc3986.html#section-2.1)).
- NOTE: `UTF-8''` is not a typo – see [Examples in RFC5987](https://datatracker.ietf.org/doc/html/rfc5987#section-3.2.2)

`Content-Disposition` must be also set when a binary response format was requested:

Expand Down Expand Up @@ -510,8 +516,9 @@ This header is more widely used in [SUBDOMAIN_GATEWAY.md](./SUBDOMAIN_GATEWAY.md

Gateway MUST return a redirect when a valid UnixFS directory was requested
without the trailing `/`, for example:

- response for `https://ipfs.io/ipns/en.wikipedia-on-ipfs.org/wiki`
(no trailing slash) will be HTTP 301 redirect with
(no trailing slash) will be HTTP 301 redirect with
`Location: /ipns/en.wikipedia-on-ipfs.org/wiki/`

### `X-Ipfs-Path` (response header)
Expand Down Expand Up @@ -588,7 +595,9 @@ Data sent with HTTP response depends on the type of requested IPFS resource:
- Raw block
- Opaque bytes, see [application/vnd.ipld.raw](https://www.iana.org/assignments/media-types/application/vnd.ipld.raw)
- CAR
- CAR file or stream, see [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car)
- Arbitrary DAG as a verifiable CAR file or a stream, see [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car)
- TAR
- Deserialized UnixFS files and directories as a TAR file or a stream, see [application/x-tar](https://en.wikipedia.org/wiki/Tar_(computing))
<!-- TODO: https://github.com/ipfs/go-ipfs/issues/8823
- dag-json / dag-cbor
- See [https://github.com/ipfs/go-ipfs/issues/8823](https://github.com/ipfs/go-ipfs/issues/8823)
Expand All @@ -614,7 +623,7 @@ IPLD data, starting from that data which the CID identified.
**Note:** Other types of gateway may allow for passing CID by other means, such
as `Host` header, changing the rules behind path splitting.
(See [SUBDOMAIN_GATEWAY.md](./SUBDOMAIN_GATEWAY.md)
and [DNSLINK_GATEWAY.md](./DNSLINK_GATEWAY.md)).
and [DNSLINK_GATEWAY.md](./DNSLINK_GATEWAY.md)).

### Traversing remaining path

Expand All @@ -628,6 +637,7 @@ low level logical pathing from IPLD:
### Handling traversal errors

Gateway MUST respond with HTTP error when it is not possible to traverse the requested content path:

- [`404 Not Found`](#404-not-found) should be returned when the root CID is valid and traversable, but
the DAG it represents does not include content path remainder.
- Error response body should indicate which part of immutable content path (`/ipfs/{cid}/path/to/file`) is missing
Expand Down Expand Up @@ -655,6 +665,7 @@ Implementations are encouraged to support pluggable denylists to allow IPFS
node operators to opt into not hosting previously flagged content.

Gateway MUST respond with HTTP error when requested CID is on any of active denylists:

- [410 Gone](#410-gone) returned when CID is denied for non-legal reasons, or when the exact reason is unknown
- [451 Unavailable For Legal Reasons](#451-unavailable-for-legal-reasons) returned when denylist indicates that content was blocked on legal basis

Expand Down

0 comments on commit 4cfabca

Please sign in to comment.