Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gateway/dir-index-html: switch dir listing sizes to Tsize #9058

Closed
lidel opened this issue Jun 24, 2022 · 5 comments · Fixed by #9481
Closed

gateway/dir-index-html: switch dir listing sizes to Tsize #9058

lidel opened this issue Jun 24, 2022 · 5 comments · Fixed by #9481
Assignees
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding need/maintainers-input Needs input from the current maintainer(s) need/triage Needs initial labeling and prioritization topic/gateway Topic gateway topic/perf Performance

Comments

@lidel
Copy link
Member

lidel commented Jun 24, 2022

This was inspired by #8178, #8455

go-ipfs 0.13 shipped with #8853 which introduced a bad-aid where we hide "size" column in big directories.
This allows us to skip child block resolution for directories bigger than 100 items, making the entire thing load really fast.

Sadly, directories smaller than 100 are still slow.
They load slower than a directory with 101 items.

Why showing size and type is expensive

The root node of every UnixFS DAG includes information about node type and the size of raw data inside of it (without metadata).
It also has links to other DAGs representing files and directories, and the total size of DAGs they represent (data + metadata)

The "size" on directory listing is the data without metadata.
To know the exact size and type of items in a directory listing, every item triggers additional block fetch, which baloons the time it takes to return response with a directory listing.

Proposed Change

Replace "size" based on raw dfata with "DAG size" based on Tsize value already present in the root UnixFS node (see logical format). The interface will look the same.

2022-06-24_23-23

What is user benefit?

Every directory will load fast, as soon the root UnixFS node is available.
This makes an extreme difference on the first load, when the directory is not present in local cache.

Loading bafybeiggvykl7skb2ndlmacg2k5modvudocffxjesexlod2pfvg5yhwrqm (10k items)
will be as fast as a directory with 100.

Won't this be causing issues?

These values are provided in HTML dir listing only for quick eyeballing, and the difference between raw data and Tsize will usually be small enough to not impact this purpose:

  • small files that fit into a single raw block will have
  • big file size won't be significantly impacted by ipld metadata

But just to be sure, we will add on-hover tooltip explanation of the value (right now, there is none).

Is this worth it?

To illustrate, I started a new, empty repo each time, and listed a directory with 1864 items.

go-ipfs 0.12 (which fetched every UnixFS child block to read the size) took nearly 3 minutes:

$ time ipfs ls -s --size=true /ipfs/QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm
0.59s user 0.13s system 0% cpu 2:37.03 total

Same directory, but listed with change to Tsize:

$ time ipfs dag get --output-codec=dag-json QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm
0.44s user 0.33s system 126% cpu 0.611 total

From nearly 3 minutes to under 1s. I'd say worth it.

@lidel lidel added kind/enhancement A net-new feature or improvement to an existing feature topic/gateway Topic gateway topic/perf Performance need/triage Needs initial labeling and prioritization need/analysis Needs further analysis before proceeding need/maintainers-input Needs input from the current maintainer(s) labels Jun 24, 2022
@Jorropo
Copy link
Contributor

Jorropo commented Jun 25, 2022

This assume that we also disable resolving types (if we didn't already I'm not sure).
Types are also just stored in the first block and if we have to fetch anyway that wont solve anything.

This could also be used in the fast mode (where size and type are disabled), to still give a size estimation, but then it's a feature request not a perf improvement (as the perf come from not fetching direct links).

@lidel
Copy link
Member Author

lidel commented Jun 26, 2022

Yes, the idea is to disable resolving types too. An equivalent of

ipfs ls -s --size=false --resolve-type=false  QmdmQXB2mzChmMeKY47C43LxUdg1NDJ5MWcKMKxDu7RgQm

@aschmahmann
Copy link
Contributor

Related to my question in ipfs/specs#283 (comment) wondering if this kind of thing is covered by the spec, or if it's a per-implementation "do whatever you want" kind of thing.

If the latter than sure 🤷, we can always make more changes if people complain given that it's basically just an abstract size hint that's not super usable for anything other than rough estimates (i.e. a) it's not how big the DAG is on disk, even when the data structure is correct, since it doesn't account for deduplication b) for files its not how big the file would be on disk since it's not loading the first file block with the size).

@aschmahmann
Copy link
Contributor

An aside, although not sure where else to put this. From what I can tell the less and more abstract we make the ways UnixFS directories load on gateways the easier it becomes to push for non-UnixFS data being loaded by gateways.

For example, what's the difference between rendering a map or list as a dir-index HTML directory vs loading in a UnixFS directory? Is it just TSize and some potential ability to export TAR files of directories? If so those seem a little niche to justify the special status of "UnixFS-only" for gateways. While ADLs don't really have any signaling mechanisms at the moment for gateway URLs which means we'd be stuck with just IPLD Data Model maps and lists, that's a separate issue that could be tackled.

This probably plays on some tensions between where it's worth changing /ipfs and where something like /ipld (ipfs/specs#293) would be better suited.

lidel added a commit to ipfs/specs that referenced this issue Jun 28, 2022
@lidel
Copy link
Member Author

lidel commented Jun 28, 2022

Yes, in my mind this is per-implementation "do whatever you want" kind of thing. I've added "Generated HTML with directory index" section to ipfs/specs@9fc9a9c and included basic suggestions for implementers, but also noted they are free to make their own decisions and innovations here.

As for Kubo:

  • after we add dag-json and dag-cbor support (Gateway support for dag-json and dag-cbor responses #8823), we could design "default HTML response" for these types, similar to dir-index-html
  • if we implement Tsize proposal from this thread, we should change the implicit default for Gateway.FastDirIndexThreshold to be 0
    • i think implementing this is an easy win, making things fast + decreasing load put on gateways

lidel added a commit to ipfs/specs that referenced this issue Jul 1, 2022
* feat: initial HTTP gateway specs

This adds gateway specs under ./http-gateways directory.

The aim is to document _current_ behavior (implementation in go-ipfs 0.13)
and switch the way we do the gateway work to be specs-driven.

Long term goal is to provide language and implementation agnostic
specification that anyone can use to implement compatible gateways.

* gateway: add Content-Range

* gateway: registerProtocolHandler uri router

* CODEOWNERS: add lidel for ./http-gateways

* gateway: resolving an advanced DNSLink chain

* gateway: only-if-cached HEAD behavior

* gateway: suggestions from reviewers

Co-authored-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
Co-authored-by: Vasco Santos <vasco.santos@moxy.studio>
Co-authored-by: Oli Evans <oli.evans@gmail.com>

* gateway: include CIDv1 node in summary

* gateway: reorder URI router section

As suggested in #283 (comment)

* gateway: add Denylists section

* gateway: switch only-if-cached miss to 412

Rationale: ipfs/kubo#8783 (comment)

* gateway: apply suggestions from review

Co-authored-by: Thibault Meunier <thibmeu@users.noreply.github.com>

* gateway: apply suggestions from Cloudflare

#283 (review)

* gateway: add X-Content-Type-Options

* gateway: simplify dnslink summary

https://github.com/ipfs/specs/pull/283/files#r898709569

* gateway: document 412 Precondition Failed

https://github.com/ipfs/specs/pull/283/files#r898686654

* gateway: link to ipld codecs explainer

https://github.com/ipfs/specs/pull/283/files#r898687052

* gateway: stub about handling traversal errors

https://github.com/ipfs/specs/pull/283/files#r892845860

* gateway: expand HTTP caching considerations

* gateway: editorial fixes

Co-authored-by: Steve Loeppky <stvn@loeppky.com>

* gateway: expand on Host header parsing

https://github.com/ipfs/specs/pull/283/files#r898703765

* gateway: editorial fixes

* gateway: X-Forwarded-Proto and X-Forwarded-Host

* gateway: editorial fixes

* gateway: X-Trace-Id

optional header suggested in:
#283 (comment)

rationale: having specific name as a suggestion of 'best practice' in
the specs will simplify debugging across ecosystem

* gateway: Generated HTML with directory index

Synthesis of ideas from:
ipfs/kubo#8455
and
ipfs/kubo#9058

Co-authored-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
Co-authored-by: Vasco Santos <vasco.santos@moxy.studio>
Co-authored-by: Oli Evans <oli.evans@gmail.com>
Co-authored-by: Thibault Meunier <thibmeu@users.noreply.github.com>
Co-authored-by: Steve Loeppky <stvn@loeppky.com>
@hacdias hacdias self-assigned this Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/analysis Needs further analysis before proceeding need/maintainers-input Needs input from the current maintainer(s) need/triage Needs initial labeling and prioritization topic/gateway Topic gateway topic/perf Performance
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants