Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing content with a pin request #73

Open
Gozala opened this issue Mar 3, 2021 · 36 comments
Open

Providing content with a pin request #73

Gozala opened this issue Mar 3, 2021 · 36 comments

Comments

@Gozala
Copy link
Collaborator

Gozala commented Mar 3, 2021

We were evaluating protocol/web3-dev-team#58 in the of protocol/web3-dev-team#62 and the subject of "where is pinning service going to get content from" came up. Assumption is that pinning service will fetch content from ipfs network raises some concerns:

  1. What if node that just added content is behind the NAT that pinning service can't punch through ?
  2. What if the service is used from the web browser, in which case it is highly unlikely pinning service will be able to deal it ?
  3. If all you want is to add some content to IPFS and get it pinned, spinning up a full IPFS node and waiting until content is fetched from it is an overkill.

I remember @lidel was telling me about de facto hack of encoding content in an identity hashed CID, which might overcome some of the above listed concerns but raises whole new ones:

  1. Given that it is not specified how reasonable it is to expect that it would be even supported by a pinning service ?
  2. Does that supposed to work with multi-blocks scenarios ?

    I thought that was a case, but more I think about it less sense it makes to me.

  3. Do we have a CID size limit / request payload size limit to consider ?

Either way, uploading content as identity hashed CID encoded in base64 string in JSON feels like a very impractical solution to meet specific requirements. It seems like we need to consider extending this specification to support this use case or it will not be practical for cases where just putting content on IPFS is desired.

It is also worth pointing out here that e.g. pinata has own API for such a use case https://pinata.cloud/documentation#PinFileToIPFS

/cc @alanshaw @mikeal @jnthnvctr

@mikeal
Copy link

mikeal commented Mar 3, 2021

We should just PUT a CAR file with a single root (standard Filecoin CAR file). It’s pretty clean and we already have code to pin a CAR file in Go and the JS client libraries were just updated to be smaller and faster. We have nice library infrastructure to leverage here and we end up with some really thin client libraries you can build without IPFS in the client at all.

@aschmahmann
Copy link

This is a duplicate of ipfs/in-web-browsers#18 and ipfs/in-web-browsers#22 which are already closed and resolved.

Problems 1 and 2 are solved if you assume the remote pinning service is dialable which IMO is both fair and is a prerequisite for this solution to work.

Problem 3 I think is a non-issue. Given that you have a blockstore (FlatFS, badger, S3, a CAR file, etc.) spinning up a libp2p node that supports Bitswap and only makes a single connection to the pinning service node is not a huge ask and helps us plan for the future.

@mikeal
Copy link

mikeal commented Mar 3, 2021

spinning up a libp2p node that supports Bitswap and only makes a single connection to the pinning service node is not a huge ask and helps us plan for the future.

To a Web developer working in the browser, this is a huge ask.

@aschmahmann
Copy link

aschmahmann commented Mar 3, 2021

To a Web developer working in the browser, this is a huge ask.

I mean we could just provide a library that does this. If asking developers to include a library is a huge ask then I don't see how they get any real benefit out of IPFS, since they can't get the IPLD data/CAR file to send to the pinning service without using some library for working with IPLD data.

@mikeal
Copy link

mikeal commented Mar 3, 2021

Applications are a little more complicated than that. The client/user that needs to upload the content has different needs than the consumers of that data who need it available in a decentralized network.

In some NFT use cases an artist needs to put their content into a website and then never touch the site again. There are other users who might bid, buy, trade and do other authentication against the NFT data.

IPFS is still very much a critical part of this application, but the thing standing in the way of getting more content into IPFS is that loading content you’re just trying to hand to a remote provider to run IPFS for you requires such a substantial client.

@aschmahmann
Copy link

the thing standing in the way of getting more content into IPFS is that loading content you’re just trying to hand to a remote provider to run IPFS for you requires such a substantial client.

To be convinced of any value in this argument I'd need to see some evidence that the "substantial client" of a minimal libp2p client is much more burdensome than the IPLD (and likely UnixFS) libraries needed to process the data to the point that it would have a noticeable impact on a user's choices.

Additionally, while there are some nice aspects in terms of ease of implementation using HTTP to send the data to be pinned we have issues like ipfs/in-web-browsers#9 and ipfs/in-web-browsers#7 which become impossible when you start adding in HTTP specific features (e.g. PUT for a CAR file) into this spec. You also lose out on any deduplication benefits of a protocol like Bitswap or GraphSync.

Overall, IMO this proposal is pushing us in the wrong direction. If there is some major hurdle (e.g. with using js-libp2p in the browser) with the existing API's support for this use case, then I could potentially get behind this but otherwise I don't think it's worthwhile.

Curious if @lidel has a different perspective here.

@lidel
Copy link
Member

lidel commented Mar 3, 2021

Bit late to this party, but here are my condensed thoughts:

  • "hack of encoding content in an identity hashed CID" is not feasible in real life. It works only for "helloworld"-like strings used in CI tests and was just a way to make mock-ipfs-pinning-service deterministic and run faster on CI (removing the data transfer step).
  • Like @aschmahmann noted, data transfer from client behind NAT and other limiting network topologies is solved by provider hints in origins and delegates (both sides attempting to connect to each other using those addrs in best-effort fashion).
    • This is a simple but really powerful primitive: removes the need for any DHT lookup from the picture, and works out of the box when pin remote commands are used in IPFS node because it fills origins and pre-connects to delegates out-of-the-box.
    • Moreover, if pinning service returns /wss/ multiaddr in delegates then js-ipfs running on a web page will be able to connect to the service and start bitswap immediately. This means content routing is no longer an issue when pinning happens in the browser (assuming Remote pinning service implementation in JS-IPFS protocol/web3-dev-team#58 happens).
  • Getting data into IPFS is out of scope of this API, so I am closing this issue.
    • This API operates on CIDs + optional provider hints and is not concerned by HOW data is imported to IPFS.
    • In practice, I see two solutions for importing data to IPFS:
      • Web app running full js-ipfs node and import happening via ipfs.add + pin.remote.add for true persistence and content routing (needs Remote pinning service implementation in JS-IPFS protocol/web3-dev-team#58)
      • Web app delegating import to some backend service running go-ipfs, and getting CID in return OR running js-ipfs-http-client in web app itself and doing pin.add with go-ipfs that way
    • "PUT a CAR file" or "PUT file to HTTP Gateway" would be the implementation detail that falls under one of above.
      • IMO only "HTTP PUT file to HTTP Gateway and get a CID" improves anything, because it does not require ANY library.
      • "Put a CAR" means I still need to run IPFS-aware chunker and CAR generator, which most likely won't be smaller than running full js-ipfs on a page.

@lidel lidel closed this as completed Mar 3, 2021
@Gozala Gozala reopened this Mar 3, 2021
@Gozala
Copy link
Collaborator Author

Gozala commented Mar 3, 2021

I have spend little more time thinking about this and I came to following conclusions that I would like get feedback on:

There are two very different user groups that are potential pinning service API users:

  1. Users who run IPFS nodes and want things to be pinned remotely.
  2. Users who just want to put content on IPFS.

I think current API does poor job at meeting either of those user needs. For group operating IPFS node libp2p based API would be a more effective and efficient avoiding lot of HTTP roundtrips etc...

For a group that just wants to add file to IPFS network having to run IPFS node just to add a file is a huge burden and sometimes constraints of the runtime also get in the way (e.g. serverless is a good example).

I think there is an opportunity to enable that second group and significantly reduce upfront costs (education, etc...) to get them into IPFS. I think doing it as part of pinning services is good idea because:

  1. It makes services swapable, removing platform lock-in.
  2. It make IPFS dead-simple choice for storing data.

@mikeal
Copy link

mikeal commented Mar 3, 2021

I’d like to pause this discussion for now.

It would seem that the entire purpose of a remote pinning API would be to hire a third party to run IPFS for you, and that effectively requiring a client to run IPFS in order to get content into that remote would be a barrier to maximizing the usefulness of the remote pinning service.

That said, you’re right to question how useful this would be and compare it to other work streams. I don’t think it would be a productive use of your time to send you all the NFT user research we’ve done and invite you to several more meetings where we’ve been discussing such things.

If we’re confident enough of the user need we can just build something to satisfy it ourselves. Once it’s deployed and used by these target users we can iterate on it and take what we’ve learned back to you and recommend changes to the pinning API with much more confidence in their usefulness. Then we can update anything we have already deployed to match what ends up being formally specified.

@Gozala Gozala closed this as completed Mar 3, 2021
@lidel
Copy link
Member

lidel commented May 11, 2021

Revisiting: providing DAG archive with a pin request

We marinated a bit for a few weeks, and it seems that adding the ability to upload precomputed DAG archive to a pinning service is something that, if implemented in a thoughtful way, does not go over raison d'être of this spec. @rvagg makes a good point that as long we talk about DAG archive, there is no change in existing separation of concerns:

In the Pinning API we have the notion of "provider hints" and I'm wondering, [..] what the distinction might be between "go and find it over there and pin it" vs "here it is, just pin this" (i.e. the provider hint is essentially just "it's right here!").

"Uploading" DAG archive to pinning service is the ultimate "provider hint". It operates at the same abstraction level as bitswap session, does not introduce any complexity related to vs directory handling, chunking, hashes, which greatly simplifies things on the service end.


Prior art: DAG import APIs

Right now, we have two preexisting "DAG import" endpoints:

  • go-ipfs: /api/v0/dag/import is provided on API port of every IPFS node
  • ipfs-cluster: /add is /api/v0/add augmented with support for &format=car parameter

We want to make self-hosting of pinning services easier by adding support for this spec to ipfs-cluster (ipfs-cluster/ipfs-cluster#1213), but even when that happens, we will most likely need a separate namespace/port for "service" endpoints guarded by bearer access-token.

Requirements and Constraints

Main ones:

  • Avoid magic: one should be able to use pinning service with generic HTTP clients like curl: do ipfs dag export, rsync archive to different machine, then pin remotely via curl
    • This is solved in prior art by sending DAG archive produced by ipfs dag export as multipart/form-data
  • /pins supports pinning of a single CID at a time
    • Most likely means DAG import will be also limited to a single file upload
  • Data from imported DAG should not get garbage collected before user pins it.
    At the same time, service should not pay the cost of keeping data that is not pinned by anyone.
    • We could require pinning service to protect manually imported data from GC for some time, but this shifts complexity to implementer (suddenly, one needs to track import time and expiration)
    • Better approach may be "import" operation that is atomic "import+pin"

Initial API idea (looking for feedback 👀 )

  • To keep things easy for everyone (users, services) import+pin operation should be atomic.
    • Request is multipart/form-data (follow behavior of existing import endpoints)
    • Response returned after DAG import should be JSON with PinStatus object, same behavior as in existing "Add pin object" operation. PinStatus.status maps nicely:
      • If import was successful, the response should have status set to pinned immediately
      • If some blocks were missing from archive, it should be queued, and the service should try collecting them via DHT
      • If archive is invalid or pinning is not possible due to regular reasons, failed
    • Created pin is nameless, user can customize name/metadata via inexpensive "Replace pin object" operation, if needed (no need to complicate import operation).

Obligatory 🏚️ 🚲 question: how extensible we want this to be?

  • (A) introduce /imports[/dag] endpoint that accepts multipart/form-data POST with a single DAG archive
    • This makes sense if we want to keep the door open for extending import capabilities in near future (multiple archives, maybe even directories and files like in /api/v0/add etc)
  • (B) make /pins accept POST multipart/form-data
    • Pretty clean solution, works well with existing scope, but is not extensible, and based on my OpenAPI experience having more than one accepted content type may cause issues with docs and/or code generation.

(A) sounds like more pragmatic approach, YAML will produce good docs, we could extend it if we wanted, but lmk if I missed something important at any stage of this exploration.

cc @mikeal @olizilla @obo20 @hsanjuan @rvagg @ipfs/wg-pinning-services

@rvagg
Copy link
Member

rvagg commented May 11, 2021

/pins supports pinning of a single CID at a time

  • Most likely means DAG import will be also limited to a single file upload

The pinning service is supposed to support the CID as the root of a graph:

Content Identifier (CID) points at the root of a DAG that is pinned recursively.

So it should be fine, since a CAR can contain that whole graph. Which in theory should support the pattern nft.storage is leaning in to with their cluster API library: nftstorage/ipfs-cluster@054063e#diff-13876b4beb64b9f156474dc78f9c923952a7ca210d4507b6b3135bbe244f8a60 as long as dag-cbor is supported by the endpoint.

So two additional questions raised:

  1. This probably means a CAR should only have a single root (they can have multiple in theory but we don't have practical cases where we have >1).
  2. How to handle incomplete graphs? What is the intention of the Pinning API if it can't resolve all links of a DAG for a given CID? It's entirely possible to make a CAR with an incomplete graph but maybe that's OK and it just means you only pin the partial graph? Or should they be rejected?
  3. How to handle unknown codecs that can't be traversed for the purpose of figuring out the entire graph. Say I upload a graph encoded entirely with dag-cose blocks and the pinning service doesn't support that codec. Is it rejected? Are all the blocks just blindly pinned and it's left to the fetcher to resolve the graph and the pinning service just serves them up blindly by CID? Or? (related to question 2 but also I assume these kinds of questions have come up previously for the pinning API unless we've been blindly DAG-PB/UnixFS focused?).

@lidel
Copy link
Member

lidel commented May 11, 2021

  1. This probably means a CAR should only have a single root

I think so too. Is there an easy way to tell that CAR has more than one root, before ingesting entire thing?
Ideally, service would fail fast with info that DAG archive used for pinning has to have a single root.

  1. How to handle incomplete graphs?

I am tempted to say this is up to pinning service to decide, because it is "content routing details" of sorts. Service could return queued or pinning status instead of pinned, and let the service try to find missing blocks by other means, or return failed with info explaining that only complete graphs are supported.

  1. [..] Say I upload a graph encoded entirely with dag-cose blocks and the pinning service doesn't support that codec. Is it rejected?

iiuc a workaround for pinning of encrypted DAGs that can't be traversed by a pinning service would be to create an "envelope DAG" with opaque raw blocks as leaves.

I believe paid services like Pinata will reject DAGs that can't be traversed, because they (1) only track root CIDs and pin recursively (2) calculate total size via ipfs dag stat for billing purposes (@obo20 please clarify, if I got any of this wrong).

@obo20
Copy link

obo20 commented May 11, 2021

@lidel Yes, you're correct with your understanding of things.

Our biggest requirement for anything that we ingest is that we need to be able to calculate the size of it.

The sooner we can do this in our upload pipeline the better, as it helps us make decisions on whether or not the content is allowed to be added to our systems.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

* Most likely means DAG import will be also limited to a single file upload

Can we left this requirement ? We are already running into issues in cloudless infrastructure that has memory limits per request. Current plan to overcome these limitations is to chunk up content and pin it over multiple requests.

If car had to contain a single file that would be problematic as we may not be able to fit the whole file in one car file.

P.S. I realize that thinking there was 1 file MAX, but I think we'd be better of not having such restrictions, especially since we'll have non file blocks as well.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

To keep things easy for everyone (users, services) import+pin operation should be atomic.

  • Request is multipart/form-data (follow behavior of existing import endpoints)

What the rational here ? If you import single car file why wrap it around with all the extra stuff that also needs to be parsed on the host. Seems like extra overhead for not to me unclear benefit.

I would suggest POST with content of car file would be a better option here.

  • Response returned after DAG import should be JSON with PinStatus object, same behavior as in existing "Add pin object" operation. PinStatus.status maps nicely:

    • If import was successful, the response should have status set to pinned immediately
    • If some blocks were missing from archive, it should be queued, and the service should try collecting them via DHT
    • If archive is invalid or pinning is not possible due to regular reasons, failed

I think this would be a big mistake. The whole point of import is that it is atomic operation introducing queuing and fetching from DHT is going to make it non atomic. Also I would argue that missing blocks are either:

  1. Intended by user (maybe they do not care about subgraph)
  2. Or were mistake of encoding. By fetching those blocks we would make discovering that more complicated
  • Created pin is nameless, user can customize name/metadata via inexpensive "Replace pin object" operation, if needed (no need to complicate import operation).

Obligatory 🏚️ 🚲 question: how extensible we want this to be?

  • (A) introduce /imports[/dag] endpoint that accepts multipart/form-data POST with a single DAG archive

I do not think generalizing this endpoint is a good idea. I would much rather add number of endpoints or use something like content type header to extend interface than make a very generic API.

My arguments are:

  1. It is harder to optimize things when things are arbitrary.
  2. You can have services that implement specific endpoints but not other. Combining all under single endpoint makes it a lot more difficult.
* This makes sense if we want to keep the door open for extending import capabilities in near future (multiple archives, maybe even directories and files like in  `/api/v0/add` etc)

* (B) make `/pins` accept POST  `multipart/form-data`
  
* Pretty clean solution, works well with existing scope, but is not extensible, and based on my OpenAPI experience having more than one accepted content type may cause issues with  docs and/or code generation.

(A) sounds like more pragmatic approach, YAML will produce good docs, we could extend it if we wanted, but lmk if I missed something important at any stage of this exploration.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

2\. How to handle incomplete graphs? What is the intention of the Pinning API if it can't resolve all links of a DAG for a given CID? It's entirely possible to make a CAR with an incomplete graph but maybe that's OK and it just means you only pin the partial graph? Or should they be rejected?

I would prefer having explicit parameter to tell if incomplete graph is ok or not. As there are valid cases for both. I would even go as far as make it non-optional parameter so so user has to think and make conscious decision.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

3\. How to handle unknown codecs that can't be traversed for the purpose of figuring out the entire graph. Say I upload a graph encoded entirely with dag-cose blocks and the pinning service doesn't support that codec. Is it rejected? Are all the blocks just blindly pinned and it's left to the fetcher to resolve the graph and the pinning service just serves them up blindly by CID? Or? (related to question 2 but also I assume these kinds of questions have come up previously for the pinning API unless we've been blindly DAG-PB/UnixFS focused?).

As in second I think this should be explicit! If user opted into incomplete graph import than answer is obvious. If user opted-into full graph only than I would expect host to error as it has no means to validate than invariant.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

I think so too. Is there an easy way to tell that CAR has more than one root, before ingesting entire thing?
Ideally, service would fail fast with info that DAG archive used for pinning has to have a single root.

I am not sure roots are useful beyond verification that all blocks made it. I would expect service to pin all the blocks that were in car file whether root or not. If I did not wanted that block pinned I would not included it in car file in first place.

@Gozala
Copy link
Collaborator Author

Gozala commented May 11, 2021

Here is my general feedback consolidated together:

  1. Please lets not try to be future proof and try and generalize API too much. In practice those are harder to work with because:
  2. User mistakes can form valid inputs making errors harder to fix.
  3. General APIs are also harder to optimize.
  4. Makes it harder to deprecate (a lot easier to say we no longer do this API endpoint, vs oh that type of body is no longer ok)
  5. Makes it harder to have service that implements subset of the API (or rather harder to communicate that subset)
  6. If we can extend API to support wider set of payloads we could just as well extend API with more endpoints.
  7. Please lets make this atomic operation because:
  8. Non atomicity of current API is one of it's major limitations
  9. That is an expectation that web2 devs coming from, and we fail to meet it.
  10. There is no way for user/client to act on queued pin.
  11. If car file contains blocks those are for pinning
    1. I really think service should not try to fetch any blocks that aren't in car file. If user wants that separate pin request can / should be issued for that.
    2. If we find a need for the behavior where some blocks need to be fetched and others need to be imported lets make a separate endpoint for that. Lets not complicate "pin these blocks" case please.
  12. pinning service should treat roots as explicit pins and other blocks as links
    1. In other words I expect that if I have block in car file not referenced from roots it's an error. And service should reject the whole thing
    2. Roots are the pins that I expect to see in the pin list, all other blocks are jest part of subgrap
    3. If I unpin the root later it will unpin all the linked nodes unless they are linked from other root / pin
  13. No guessing no default, make user communicate intent please
    1. User may want to pin subgraph or may have intended full graph, let's not guess but ask instead (as in user should tell)
    2. This also addresses the whole fetch from network thing. If user said to ping subgraph there is nothing to fetch if user said to pin full garph but failed to provide it then we should fail that too.

Please note that I understand that there are valid use cases where you want to upload sugbgraph and make service fetch remaining blocks for you or maybe you already have uploaded that part of sub-graph. But I would suggest to not support such use cases yet because:

  1. User could handle that by importing subgraph first and then send a request to do a recursive pin which I'd assume would attempt to fetch remaining pieces.
  2. If I want to chuck up graph and upload it, I would much rather work with API that does atomic import of chunks and then verify that graph is complete.
  3. If we find existing pieces impractical, we can take another iteration and design a specific API for this use case instead of complicating all the other use cases.

@aschmahmann
Copy link

I am highly concerned about the ramifications of modifying this API to work with non-recursively pinned data which @Gozala is suggesting.

I think this would be a big mistake. The whole point of import is that it is atomic operation

This expresses a desire for atomic pin operations.

We are already running into issues in cloudless infrastructure that has memory limits per request. Current plan to overcome these limitations is to chunk up content and pin it over multiple requests.

This expresses a desire to break up the atomic pin operation into multiple atomic pin operations. This implies either that pin operations are for incomplete graphs or that we are not pinning data graphs but just sending bundles of raw blocks that are referenced in a graph-like fashion.

Non-recursive Pins

The current API only works with pinning full graphs, it has no concept of a "best-effort" or partial DAG pin. This has been previously discussed and we removed even the semblance of partial pinning since nobody was working with data like that #17 (comment). If we feel there is value in expanding how we pin data to include non-recursive pins (e.g. pinning by selector, direct pins, depth pins, etc.) we can do that, but it seems both out of scope and something that should probably be explored in go-ipfs (or at least ipfs-cluster) before hoisting onto the community to deal with since they should have some reference implementation that is compliant.

Storing bundles of blocks

We can do this and sometimes we might even have to do this. In the case of unknown codecs this might be the only option available. However, it has a few tradeoffs that are unfortunate.

  • Pin objects become somewhat meaningless to the end user
    • Instead of storing a DAG with pin name myGraph instead we store many block bundles with names like myGraph-1-of-10.
  • Making updates (e.g. updating a file in a directory) results in lots of added complexity + overhead and the pin names become even more meaningless
    • Figuring out which myGraph-X-of-Y to replace and which blocks to replace it with is hard. Also if I want to remove/replace one block inside of that bundle I have to download + reupload the whole thing (recall you're doing this whole thing to run away from any smart syncing so none of that can help you)
    • Pin names are even more meaningless. For example, users end up having to use pin names like myGraph-v2-1-of-5 since they can no longer use the Pin Replace operation to replace the DAG backing myGraph
  • The upload library becomes more complex as it takes your DAG and turns it into a bunch of DAG-CBOR DAGs that reference the bundles of raw blocks

In the Pinning API we have the notion of "provider hints" and I'm wondering, [..] what the distinction might be between "go and find it over there and pin it" vs "here it is, just pin this" (i.e. the provider hint is essentially just "it's right here!").

The obvious distinctions are that there is no way to do partial sync in just sending a group of blocks as a CAR file and that sending the data itself takes away the providers choice to not read the data.

However, one more subtle distinction is that with provider hints the pinning service is still separating out the "pinning" and "fetching" operations whereas this proposal combines them both. A true "here it is" equivalent of a provider hint would be allowing the user to send the pinning service a CAR file filled with blocks that they can choose to use or discard that would occur after a remote pin add operation had already been started. In this "provider hint" model it literally would not matter what data was in the CAR files as long as they were in a readable format (one/many roots, in/complete DAGs, un/supported codecs are all irrelevant).

No this does not match the "atomic" property that some here are looking for, but the desire for this property is where the complexity is coming from.

@mikeal
Copy link

mikeal commented May 12, 2021

I think there’s a simpler path that we’re missing here. If we conceptualize this feature in simpler terms:

  1. Write the blocks in this CAR file to your Blockstore.
  2. Pin the root.
  3. Return when the pin is available

we can defer most of this complexity. Maybe we do need a way to pin partial graphs and a bunch of other features, but let’s not complicate this feature with a lot of new complex behavior. Let’s just avoid changing the behavior of pinning at all.

If you send a partial graph then the request won’t return until we’ve pulled the rest of that graph out of the network. Because that’s how the pinner already works, this feature just lets you load some blocks in first.

If the codecs aren’t available the pin call with fail, cause that’s how the pinner already works.

Pinning partial graphs would be cool, but let’s have that conversation in a new feature request because that may involve a bigger rethinking of the pinner, or maybe not, but we can make progress here without taking that on.

@Gozala
Copy link
Collaborator Author

Gozala commented May 12, 2021

  • Write the blocks in this CAR file to your Blockstore.
  • Pin the root.
  • Return when the pin is available

This is exactly what I am asking for, however there are few nuances that I do think need to be defined.

we can defer most of this complexity. Maybe we do need a way to pin partial graphs and a bunch of other features, but let’s not complicate this feature with a lot of new complex behavior. Let’s just avoid changing the behavior of pinning at all.

So it’s not about complex behavior, it’s about what does pinning service do if some blocks from the root are missing ?

it can either

  1. not care (which is your partial graph pinning)
  2. It can error (only full graphs allowed)
  3. It can try look fro block on the network.

I think we should define expected behavior regardless of choice. I also think 3rd is the worst option. And I can see reasonable arguments in favor of 1st and 2nd. That is why I suggest to let user specify desired behavior between 1 and 2. If that is too much , I’d say 1st is less limiting & it puts bit more burden on user to assemble car file properly

If you send a partial graph then the request won’t return until we’ve pulled the rest of that graph out of the network. Because that’s how the pinner already works, this feature just lets you load some blocks in first.

I think this is really bad option. What if it can’t find blocks. I really think if user is omitting blocks it on user & there might be good reasons to do so, maybe they’re just coming with next request or were already pinned by last request.

This is what introduces complexity here. It also makes untraversable graphs unpinnable. On the other hand just pinning provided blocks simplifies this and makes service codec agnostic, you hive it blocks it pins them that’s all.

Pinning partial graphs would be cool, but let’s have that conversation in a new feature request because that may involve a bigger rethinking of the pinner, or maybe not, but we can make progress here without taking that on.

i think this is wrong framing, it is not about partial graph support but rathe making service graph agnostic, you give blocks it pins, doesn’t need to know or care about graphs, codecs or any of that.

@Gozala
Copy link
Collaborator Author

Gozala commented May 12, 2021

Pin objects become somewhat meaningless to the end user

  • Instead of storing a DAG with pin name myGraph instead we store many block bundles with names like myGraph-1-of-10.

I do share general concern here that multiple pins that collectively form a single graph is a real concern as relationships between them are not encoded. Originally my thinking here was that subgraps could be temporal until the rest of the graph is imported at which point those could be dropped, but now I am realizing that can not be accomplished without service understanding graphs which is what my motivation was.

@Gozala
Copy link
Collaborator Author

Gozala commented May 12, 2021

I have thought bit more about these and here few not fully fleshed out notes:

  1. Most of the problems in "Storing bundles of blocks" has to do with the fact that it introduces "myGraph-X-of-Y" style pins. Maybe that could be addressed by merging imports that share the root. As in if I import n car files and they all declare root bafy...graph all the imported blocks could be made part of it. This would:
    1. 💚 Allow chunked graph uploads and overcome mentioned limitations in constrained environments like cloudfare workers.
    2. 💚 Would avoid myGraph-X-of-Y bundle names & required management.
    3. 💚 Would provide atomic transactions.
    4. 💚 Would not require service to know codecs or do any graph traversal.
    5. 💔 Would not provide a good sync primitive. As in pin this new version of mfs with some files edited.
  2. There are valid use cases for importing car files that omit blocks because they are known to be available, which no 1 will not cover. But I think that serves a sync use case and maybe a separate future endpoint could provide an adequate API for it. In the meantime combination of existing APIs and import as described in no 1 could probably fill that gap.
  3. I think @aschmahmann is proposing something really interesting here

    A true "here it is" equivalent of a provider hint would be allowing the user to send the pinning service a CAR file filled with blocks that they can choose to use or discard that would occur after a remote pin add operation had already been started. In this "provider hint" model it literally would not matter what data was in the CAR files as long as they were in a readable format (one/many roots, in/complete DAGs, un/supported codecs are all irrelevant).
    It addresses:

    1. 💚 Allow chunked graph uploads and overcome mentioned limitations in constrained environments like cloudfare workers.
    2. 💚 Avoid myGraph-X-of-Y stuff.
    3. 💔 No transactional guarantees, things may fail and only thing user can do is keep retrying.
    4. ❤️‍🩹 Would require service to support specific codecs.
    5. 💜 Might provide a reasonable sync solution.

I wonder if some hybrid of 1 and 3 could provide a reasonable compromise with some transactional guarantees. E.g. what if:

  1. car imports would associate all blocks to the roots. So multiple imports would just pile up blocks for specific CID.
  2. We had a version of pin this CID or fail API. It would assume that clients have already imported or otherwise pinned all the necessary blocks ahead of time. In this endpoint service would not queue or search network, it either has all blocks and can do the pin or if it can't find some block it errors pointing out missing block. Which would allow client to import missing blocks and try again.

This would

  1. 💚 Allow chunked graph uploads and overcome mentioned limitations in constrained environments like cloudfare workers.
  2. 💚 Avoid myGraph-X-of-Y stuff.
  3. 💚 Provide transactional guarantee.
  4. 💛 Import would not require codec support. Pining would require, but this limitation could be overcome by pining additional index blocks with links to relevant nodes.
  5. 💜 Might provide a reasonable sync solution.

@rvagg
Copy link
Member

rvagg commented May 12, 2021

It seems the options might boil down to something like this, at a really rough level:

Option 1 - "atomic" CAR uploads, "pin this thing please", root(s) say what to pin, graph is contained and complete, incomplete graph is a failure

  • Fails for situations where the user can't put it all in one CAR, so they either have to fall back to a standard provider hint or, more likely if they want the simpler bundle+send workflow, will just send multiple smaller CARs to form a larger graph and pin recursively, pinning more "roots" than they should and having a single graph pinned at multiple intermediate nodes. This makes administration annoying and will gum up any DHT/provider publishing of "here's the roots I have" with lots of small roots.

Option 2 - allow incomplete graphs in a CAR, and solve for completeness <in some manner> (perhaps find it in the DHT or add some other funky mechanism that lets me upload additional CARs until we achieve "completeness" and then this is my singular "pin" operation)

  • No atomicity, messy, complex to implement, error prone, lots of state to deal with, kind of yuk

Option 3 - allow incomplete graphs in a CAR and make pinning atomic (i.e. pinning incomplete graphs now supported)

  • A change in scope/design of the pinning API spec (from what I understand from this discussion)
  • Maybe adds additional complexity to running a pinning operation because it breaks assumptions (billing? error/sanity checking?)
  • Kind of yuk, but also not entirely unreasonable because we a world in which you can't link arbitrarily to other graphs is going to leave us in a very constrained place - but this is the kind of thing that should be solved with explicitness, i.e. selectors or similar to define the edges and maybe not "here's a CAR and its completeness is my definition of the graph I care about"

Option 4 - keep this out of the Pinning API, make it easier to temporarily park content elsewhere and use provider hints as they are - could involve making ipfs.io and other public gateways writable and with reasonable TTL or LRU - /dag/import lets you import as many CARs as you like and get them into the DHT available for collection by a pinning service

  • Not so "atomic"
  • Multiple calls required to get a graph pinned - but if you're in a constrained environment and need to do multiple CARs then you're already doing this
  • Multiple endpoints to talk to - but perhaps we could also have an adjunct to the Pinning API that suggests /dag/import should also be made available, purely for this purpose?

@mikeal
Copy link

mikeal commented May 12, 2021

it (pinner) can either

It can only do the current behavior if we refrain from changing the pinner. There’s lots of good ideas in here but they should be their own feature requests against the pinner. As a first step we should just add the blocks and then call the existing pinner, with all of its current behavior and limitations.

@olizilla
Copy link
Member

olizilla commented May 12, 2021

I think most people would tell us what they want is "here let me post you some files, you give me a CID or fail, take my money". Pinata and Infura offer that already, but through custom apis.... we could pave that cowpath, but we've become focused on car files.

To verify what feels implicit in this thread, I think we are trying to nudge people to use car files here:

  • as having users create their own CIDs is good practice for trustless block shuffling.
  • as a workaround for fitting in with the original vision of the pinning service API.

I'm a just about sold on using car files here, but we should be clear that we'd be offering users what we think they really need, not what they would tell us they want. We should be very clear about why we would do that and how we intend to message it. Can we state why POSTing files is out of scope?

@olizilla
Copy link
Member

We're writing the thing to car up your files. We'll get that working nicely and then report back on if it's something we would want to put as a minimum requirement to playing the pinning service game.

@obo20
Copy link

obo20 commented May 12, 2021

From reading through this thread, it seems like the original desire here was to allow users an easier way to directly upload their NFT data to a pinning service in a way that doesn't rely on async pinning through the network.

I'm seeing a lot of added functionality/complexity being discussed here, and while I think there's a place for a lot of this, I worry that it's overcomplicating things in the near term.

In addition, I'm also a little worried that we won't even be able to support something like this if it gets too complex. We might be able to, but the more complex things get, the harder things are going to be for us to workaround in order to get everything to work with our existing infra.

I would really like to see a simple "golden path" where "user has a file, uploads it to pinning service, gets CID". My guess is that most users/devs aren't going to know what a .car file is, or care about anything such as partial pinning. And I don't think they should have to know.

I don't necessarily care one way or the other if things like CAR files are used for the file format, but anything that's created is going to need to be massively automated to work "automagically" with easy-to-use libraries in order for this to be successful.

@Gozala
Copy link
Collaborator Author

Gozala commented May 12, 2021

I think most people would tell us what they want is "here let me post you some files, you give me a CID or fail, take my money". Pinata and Infura offer that already, but through custom apis.... we could pave that cowpath, but we've become focused on car files.

This is definitely a sentiment I had when opening this issue. And I absolutely agree that if what you have is file or set of them nothing beats simple multipart/form-data POST. And that is pretty much what pinatas https://api.pinata.cloud/pinning/pinFileToIPFS provides.

I think reason why car files got pulled into discussion is because building nft.storage we found ourselves needing to pin not just files but also non file dags and car files seem to provide a reasonable and simple way to upload those.

That said maybe it should be a separate discussion / API extension because I do not see a a good reason for complicating simple case of uploading files with cars.

To verify what feels implicit in this thread, I think we are trying to nudge people to use car files here:

  • as having users create their own CIDs is good practice for trustless block shuffling.

I think primary reason is it supports dags beyond unixfs

There is a bit of extra utility as it eliminates inconsistencies that could arise from different chunking or hashing preferences.

  • as a workaround for fitting in with the original vision of the pinning service API.

I'm a just about sold on using car files here, but we should be clear that we'd be offering users what we think they really need, not what they would tell us they want. We should be very clear about why we would do that and how we intend to message it. Can we state why POSTing files is out of scope?

I think it would be best:

  1. To address simple case by POSTing files.
  2. Have separate thread / effort to address non file use cases with cars.

@lidel
Copy link
Member

lidel commented May 12, 2021

Reducing complexity

I share concerns around making DAG archive handling too complex.

In my mind import+pin should be very simple: in case of DAG archive expect single root, complete DAG and instant pinned status.

Everything else should return error. If the dag is bigger than it makes sense for a single upload, that should be solved by either regular bitswap, or userland sharding (importing subgraphs and then pinning true root + unpinning subroots).

Ack that even if we keep DAG import simple, at the end of the day, people will still ask why there is no file import, so let's revisit...

Why we had no FILE import in v1 if this spec

Can we state why POSTing files is out of scope?

iirc original reasons were to:

  • keep API surface small due to time and resourcing constraints
  • incentivize people to run IPFS nodes and dogfood bitswap for data transfer
  • avoid ossification of the way DAGs are constructed
  • avoid feature creep (users asking for ability to control every aspect of ipfs add)

I believe those things may no longer hold as strong as they did last year, ecosystem looks a bit different (Filecoin shipped, Brave and Opera shipped ipfs://, NFTs etc).

Why we might revisit this and support FILE imports

I would really like to see a simple "golden path" where "user has a file, uploads it to pinning service, gets CID". My guess is that most users/devs aren't going to know what a .car file is, or care about anything such as partial pinning. And I don't think they should have to know.

Ack. Personally, hoped to fill this void by making writable gateways a thing, but there are unknowns around GC, and expecting people to do two operations is less friendly than atomic "import+pin".

Have separate thread / effort to address non file use cases with cars.

We may produce better API if we design it around more than one import type or use case.

I suggested use of multipart/form-data in POST /import because it enables sending data + optional metadata.

This way we could have POST /import that supports two types (below is broad strokes):

  • FILE – similar to PinFileToIPFS, but with file and optional pin field that includes Pin object minus cid field, because it will be filled up after file is imported.
  • DAG – similar to FILE but expects DAG archive in dag field (or we could have separate /dag), and pinned cid is taken the root in DAG archive

The nice thing here is that we solve for both simple and advanced use cases,
and we avoid feature creep by saying "if you want more advanced import, custom hash or chunker, create DAG archive yourself".

Would this be acceptable? I could open PR if this is not too controversial.

Think about import+pin of JSON/CBOR

Even if we have FILE and DAG import+pin operation, working with json/cbor is still painful (requires creation of DAG archive).
We should design import API so we can extend it at some point and add a thin porcelain for pinning CBOR documents. It could be as simple as:

  • CBOR – expect json or cbor data and returns CID with dag-cbor codec

If people could send JSON to pinning service, get a CID, and (soon) load it via gateway, then we removed a lot of complexity that is blocking people from using advanced IPLD features in the long term (cc @warpfork).

@Gozala
Copy link
Collaborator Author

Gozala commented May 13, 2021

  • FILE – similar to PinFileToIPFS, but with file and optional pin field that includes Pin object minus cid field, because it will be filled up after file is imported.

I like the idea of POST /import with Content-Type: multipart/form-data for uploading set of files. That said I am not sure I understand what you mean by:

optional pin field that includes Pin object minus cid field, because it will be filled up after file is imported.

Few thing that need to be clarified there would be:

  1. Do all files end up in a same pin as unixfs directory and you get one CID, or do you get CID per file etc...
    • I personally like the idea of creating a node with pointers to all the files.
  2. What are the hashing and chunking algorithms used ? (Ideal same payload would result in same CID(s) regardless of pinning service)
    • My preference here would be to make hashing and chunking parameters mandatory on API endpoint but optional in clients. That would guarantee that same client will produce same CIDs regardless of pinning service or fail all together.

DAG – similar to FILE but expects DAG archive in dag field (or we could have separate /dag), and pinned cid is taken the root in DAG archive

I would like this to be either separate endpint all together e.g. /dag/import or at least Content-Type: application/car. I do not like idea of wrapping car format with multipart/form-data. Instead I think we should evolve car (or possibly alternative) data format so it could contain optional metadata needed.

I also still would like following do be specified:

  1. What happens to blocks in car file that are not referenced from the root(s) ?
    • Are they dropped ?
    • Does import fail ?
  2. What happens with blocks that are encoded via codec not available to pinning service ?
  • Does this cause import error ?
  • Do they get pinned ?
  • Something else ?

In an effort of keeping things simple while making dag import codec agnostic I would like to propose following requirements:

  1. Root node(s) should be either encoded via supported encoding (e.g. cbor, pb, identity). If root is encoded via unsupported format that causes import to fails.
  2. If block in the car file isn't referenced from the root (or from it's traversable subgraph) import fails with error.
    • Alternatively we could specify that it drops such blocks and response points out which blocks were dropped.
  3. Each root in the car file creates a separate pin
    • Alternatively we could mandate single root, but I'm not sure it is going to make things that much simpler.
  4. Pinning service does not care if there are links to blocks outside of what's in the car. It just calls those out in the response. (I think this was called upon as a complicating factor, but I think it is quite the opposite as pinning service is not required to do anything. I think it also simplifies billing as pin size is basically a car size. If I'm overlooking complicating factors here please point those out).

I believe this would enable dag import API:

  1. To support arbitrary codecs (It is just client that will need to provide an index with links to all such blocks in a root).
  2. Clients will be able to chunk graphs as desired without having to enure that every chunk is a complete graph.

The nice thing here is that we solve for both simple and advanced use cases,
and we avoid feature creep by saying "if you want more advanced import, custom hash or chunker, create DAG archive yourself".

I love it! Yet I still want to make pinning service swapping 0-cost & and unless we do make chunking & hashing predictable it is not going to be the case. Still I think for end user simple case can remain simple as those options would be encoded in the client itself.

CBOR – expect json or cbor data and returns CID with dag-cbor codec

This is interesting, I imagine multipart/form-data could be used to do more or less what car does, but omitting CIDs and encoding as JSON. However anything with links would get immediately complicated. Either way could that be yet another endpoint evaluated separately. I don't think overloading same endpoint is making things simpler or easier to evolve the just hide true size of API surface.

@hsanjuan
Copy link

ok, I'll throw my 2 cents on some discussion points:

Regarding atomicity

In practice, when adding a CAR you can pin before adding. This protects any GC from removing whatever you are going to add I guess but it's just a trick. As of now cluster would fail this part because IPFS provides no way of warding off against automatic GC when you are doing something like adding blocks.

Anyways, it is obvious that the pinning service should not be GC-ing stuff that is supposed to be pinned. For me, when integrating with go-ipfs, that is more an operational practice than an implementation requirement, given the state-of-the-art GC that it brings and the limited control around as a "client".

What happens to blocks in car file that are not referenced from the root(s) ?

This should be unspecified. In practice, they get added to the blockstore and they do not get pinned. Pinning is recursive from a root. They get added because the pinning service has no interest in the complexity to do anything in a different way. We do not know in which order the blocks are (unspecified in the CAR spec I think). In theory we do not even need to parse them, let alone interpret their links, they can go straight into the blockstore.

What happens with blocks that are encoded via codec not available to pinning service ?

Unspecified. Pin error likely. Import error perhaps. An error in all cases.


I don't think the specification of this should be blocked on things affecting a 1% of people like:

  • Unknown codecs (so sorry but pinning service should not have to go through a hundred loops to support arbitrary closed formats)
  • Partial CARs with missing blocks. Weird CARs with extra blocks (use at own risk).

The API supports pinning things 1 by 1. Therefore the semantics of pinning something with an "ultimate hint" (CAR attached) should be limited to adding 1 root-CID. Adding multiple things at once can share semantics with pinning multiple things at once, if the API is ever extended in that direction. And this will likely require new specific endpoints.


To summarize. For now:

  • Decide on an endpoint name
  • Decide whether form/multipart or single POST body for a single CAR file with 1 dag with 1 root. Everything else unspecified.
  • Decide on response format

For later:

  • What are the right semantics for CARs with multiple roots, multipart uploads with multiple CARs, multipart uploads with multiple non-CARs etc.
  • If the pinning service is ever going to need a normal chunking-dagbuilding endpoint (ala Pinata). I would like that clients do this and just send CARs. But clients cannot chunk-dagbuild and convert to CAR in a single stream, since the CAR header is up-front, and to write that, the client needs to have finalized the dag-building process. New CAR format with a Footer would be better suited for "stream to car" operations.
  • How the pinning service can expose compatibility features, like the supported IPLD-formats that it can handle, max block size, default add-settings etc. (/about endpoint of sorts).
  • How go-ipfs can support transactional operations to better support pinning services (multi-block imports that cannot be GCed, batches of pins, performant GC that does not freeze the whole service, pinsets, partial pins, depth-limited pins).

@lidel
Copy link
Member

lidel commented Jun 15, 2021

A relevant project proposal for adding "Chunked CAR Uploads" to nft.storage is at protocol/web3-dev-team#111
Ideally, we would come up with something that is generic enough to be included in this spec.

@sudeepdino008
Copy link

Any updates on this?

@lidel
Copy link
Member

lidel commented May 22, 2022

Afaik nobody is working on this at the moment,
But if someone proposes the API in a PR I'm happy to review.

To get things started, what we need is an POST /ipfs upload endpoint (ideally, compatible with writable gateway, so we have a single way of uploading things) that accepts:

  • multipart/form-data for single file upload (curl -v -F upload=@image.jpg $URL)
  • application/vnd.ipld.car for CAR upload (for more advanced DAGs, directories, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants