Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-359: Multi gateway client #359

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

markg85
Copy link

@markg85 markg85 commented Dec 16, 2022

A spec to describe how multi gateway clients - formally known as racing gateways - should behave. This is very much a companion spec to #356.

To refine it's place.
This spec describes how multi gateway clients work and should be implemented.
#356 would have this in it's implementation.


## Motivation

When developing an application with IPFS functionality you'd ideally want more then 1 gateway and distribute the requests among N gateways. This spec relies on IPIP-0280 (gateways file).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note to remind you to link to IPIP-0280 once it is merged


### Keeping the usable gateway list fresh in the background

Getting this list of gateways and maintaining if they should be used can take quite some time. The adviced approach here is to run each request in an async matter where the async flow follows the same flow as the above flowchart.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adviced => advised*


### Configuration options

`concurrent requests` Defaults to 10. There must be a way to specify how many concurrent requests the `multi gateway client` does per IPFS request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


`max simultaneous cids` Defaults to 5. There must be a way to define how many simultaneous IPFS requests the `multi gateway client` can handle at any given time.

`max total gateways in use` Defaults to 25. There must be a way to specify how many total gateways can be used for the `multi gateway client` as a whole.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for having a limit here? Seems to me that more would always be better


`racing` Defaults to false. There must be a way to specify if `racing` should be used. Racing means the `multi gateway client` will ask at most the number of `concurrent requests` to all download the same data. The one who downloads it first if the one whose output is used, the rest is ignored.

`verify raw` Defaults to true. This tells the `multi gateway client` implementation to verify RAW data as wel as CAR data. Setting this option to true (the default) means the `multi gateway client` is guaranteed to only give back valid data. If this option is set to false then raw data is returned as-is, unverified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: wel => well


The data retrieval for a given CID must adhere to the configuration options.

There must be an async way to get the data represented by that CID. While the `multi gateway client` can handle any CID data, in it's default settings all data is being verified. If `verify raw` is set to false then raw data is passed back as-is. CAR data is always verified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any usecase to have CAR data returned without verifying? Probably not, but if so we should include an option for that as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this, and clearly state the spec should always verify received bytes against expected CIDs.
There should not be any footgun that allows MITM/spoofing of user data.


`concurrent requests` Defaults to 10. There must be a way to specify how many concurrent requests the `multi gateway client` does per IPFS request.

`max simultaneous cids` Defaults to 5. There must be a way to define how many simultaneous IPFS requests the `multi gateway client` can handle at any given time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason "5" was chosen here? It may make sense to set to 6 also for the reasons above

@meandavejustice
Copy link

I think there should be more info on the defined behavior when racing is set to false, especially since it is the default.

@meandavejustice
Copy link

Leaving a note so that we remember to capture #356 (comment)

@lidel lidel changed the title IPIP-0000: Multi gateway client IPIP-359: Multi gateway client Jan 19, 2023
integrations/MULTI_GATEWAY_CLIENT.md Outdated Show resolved Hide resolved

### Finding new gateways

The `gateways` file is parsed to know the initial - bootstrap - gateways. Each line in this file is a single gateway. This list of gateways should be stored internally in this `multi gateway client` implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • How lines are separated? Make it clear both \n and \r\n are supported.
  • How each line should be parsed? (Trim whitespace and parse as URL from https://url.spec.whatwg.org ?)

https://ipfs.io
```

From this point on the client should iterate over those gateways and request each of them to give a list of [gateways that it knows](#Gateway-returns-list-of-gateways-it-knows). Based on the return, this should result in a vastly bigger list of potentially usable gateways:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that there is no protocol for this atm.

FYSA there is vaguely similar proposal for ambient discovery of HTTP content routers (IPIP-342), we also talk about HTTP transport based on gateway MTB5.

Unless you plan to wait with this IPIP until we have something, consider removing "gateway discovery" and limit scope to manual management done by client implementaitons.

G --> H[Store gateway];
```

The `200ms` threshold here is arbitrarily picked. From a decentralized point of view, 200ms allows you to go roughly halfway across the globe assuming your internet connection is stable. From a data retrieval point of view 200ms can be slow but can be just fine too. For example, if a site loads with 1 connection at a time with each connection having a 200ms latency then you will experience that site to "load slow". But if you load the same site with multiple concurrent connections where "some" might hit the 200ms threshold then you won't see much difference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200ms allows you to go roughly halfway across the globe assuming your internet connection is stable

A big part of the planet dreams to have latency this low.
I suggest replacing it with a dynamic value based on median latency across all gateways, plus some arbitrary timeout.


The data retrieval for a given CID must adhere to the configuration options.

There must be an async way to get the data represented by that CID. While the `multi gateway client` can handle any CID data, in it's default settings all data is being verified. If `verify raw` is set to false then raw data is passed back as-is. CAR data is always verified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this, and clearly state the spec should always verify received bytes against expected CIDs.
There should not be any footgun that allows MITM/spoofing of user data.


### Request method

There must be a method to allow IPFS data retrieval. The input for this method must be an IPFS url in these forms: `ipfs://<cid>` and `ipns://<cid>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we introduce ipns:// we need to add some paragraphs that answer below questions:

  • How is IPNS resolved? Does it support DNSLink and IPNS records, or only one of them?
    • For IPNS record add dependency on IPIP-351 for end-to-end verification of IPNS.
    • For DNSLink, how should client resolve TXT records? OS resolver? DNS-over-HTTPS? Oblivious DNS?


### Security

N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Note that clients must verify received blocks before using them, and discard ones which do not match expected CID.
  • If ipns:// is to be supported, note if / how to handle DNSLinks


### Compatibility

N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to

Copy link
Member

@lidel lidel Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#only-if-cached-head-behavior as mechanism for prioritizing gateways which already have the data? Shotgunning fetch request to 5 gateways and getting same data 5 times back is super wasteful.


Is 2 requests. These count at `max simultaneous cids` where the default is 5 maximum. If there are more then `max simultaneous cids` then those that don't get handled will be put on a queue to be handled as soon as a slot becomes available.

Internally that CID is represented by N different CIDs (each block). Say `bafyA` consists of 100 blocks (simplified depiction):
Copy link
Member

@lidel lidel Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the client always sends the first request for a single block, deserialize it, and then send CAR request for its branches? This is fine for MVP i guess, but it is hard to make a good decision when to swith from block to CAR for a deeper DAG.

Hannah made a demo during MTB5 and had some good ideas about adding option to fetch CAR with non-leave blocks first (metadata), and then fetching leaves with actuald ata at the end – wrote some notes in #348 (comment). It also included byte range requests, which are important for use cases like video seeking.

I feel we should strongly consider adding these parameters to CAR requests, before this IPIP is finalized.
(Ok to PoC implementation with naive Block/full-CAR for now, but we want better spec and implementation at the end of the road).


### CAR verification file

Besides verifying for response headers, we should also define which blob we actually expect. Like a "Hello world" or "Hello IPFS".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a quick heartbeat check, a CAR with single root for a zero-length block will be enough, and won't waste much bandwidth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏃 In Progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants