Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: opt-in GRAPH_BACKEND #61

Merged
merged 29 commits into from
Apr 3, 2023
Merged

feat: opt-in GRAPH_BACKEND #61

merged 29 commits into from
Apr 3, 2023

Conversation

aschmahmann
Copy link
Contributor

@aschmahmann aschmahmann commented Mar 15, 2023

Based on ipfs/boxo#176, and ipfs/boxo#245, part of #62


Update 3-27-23:

The general implementation iteration plan is:

  1. Fetch CAR into per-request memory blockstore and serve response
  2. Fetch CAR into shared memory blockstore and serve response along with a blockservice that does block requests for missing data
  3. Start doing the walk locally and then if a path segment is incomplete send a request for blocks and upon every received block try to continue
  4. Start doing the walk locally and keep a list of "plausible" blocks, if after issuing a request we get a non-plausible block then report them and attempt to recover by redoing the last segment
  5. Don't redo the last segment fully if it's part of a UnixFS file and we can do range requests

We have completed the first and most of the second.

Blockers for merging + testing this PR more widely are:

  • There's an OOM issue
  • Adding metrics tracking the different types of graph requests the frontend makes

Finishing the basics of phase 2 will likely also happen and will help with memory pressure.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(first pass, will take it for a spin on Friday)

backend/backend.exe Outdated Show resolved Hide resolved
backend/handlers.go Outdated Show resolved Hide resolved

// Setup header for the output car
err = car.WriteHeader(&car.CarHeader{
Roots: []cid.Cid{rootCid},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 is ipfspath always /ipfs/cid and never /ipfs/cid/sub/path?

Producing a CAR that has /ipfs/cid listed as root, and not the CID of /ipfs/cid/sub/path, means it will over-fetch when someone tried to import it via ipfs dag import.

If subpaths are supported here, we either need to list CID of /path and not /cid, or list no roots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely also /ipfs/cid/path/to/stuff. It's unclear to me what is supposed to go in the roots here though. IIUC some of the CAR splitting tools that exist in dag-house/filecoin/etc. for when they want to take a DAG and split it across CAR files put in the root of an incomplete graph at the front of the CAR files.

Maybe we could just remove the roots entirely, but not sure if some tooling expects them to exist. This (perhaps out of date?) document basically leaves all this as 🤷 https://ipld.io/specs/transport/car/carv1/#unresolved-items.

@rvagg what roots are you emitting from Lassie for partial DAGs?

Copy link
Member

@rvagg rvagg Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always the requested CID - based on the assumption that since we're working with "verifiable" CARs that you want to start from the root and navigate into your content and that the root is the user's trust anchor that we have to revolve around.

It also gives us a nice workflow for these partial DAGs where we can pipe them into plain UnixFS tooling that assumes the root is where everything starts. So right now you can do things like lassie fetch -o - <cid>/path/to/thing | car extract <cid>/path/to/thing - and you have verified download and extraction without needing to do anything else! (Latest car can also skip the whole <cid>/path/to/thing argument, it'll just fish around in your incomplete DAG underneath the root of the CAR and find whatever's in there to extract and lassie will only give you the minimum, which is still verified).

Without additional metadata to embed "/path/to/thing" in the CAR header, I can kind of see an argument for wanting to put an additional "root" for each step of the DAG along that path (which could get quite large!), but that seems like a bad fix for something that's not really essential—we only ever give you just enough to reconstitute what you need to get from the root to your content so I think our tooling should be able to deal with these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I forgot to mention also wrt multiple roots for each step on the path, even if we just only wrote the root that was at the path terminus - we don't have that information when we start writing the CAR and it'd be ideal to not have to buffer and delay until we're deep enough to get that information, because again these path DAGs could get quite large!

And aside, there still remains open the possibility of abusing CARv2 to squish in some kind of metdata if we really want it: ipld/go-car#322, ipld/js-car#89. But I think first we should explore what we're really solving for—if it's just for ipfs dag import then let's either (a) fix that, or (b) provide adjacent tooling to make the problems go away. (or more radically, (c) consider that a legacy problem?)

Copy link
Member

@lidel lidel Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my worry is the ipfs dag import UX becomes a footgun when we introduce partial CARs.

@rvagg my vote would be to (a) fix partial CAR use case by changing default to --pin-roots=false.
Filled ipfs/kubo#9765, would appreciate support/feedback there.

lib/graph_gateway.go Outdated Show resolved Hide resolved
lib/pubsub.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
aschmahmann and others added 4 commits March 24, 2023 07:20
Co-authored-by: Marcin Rataj <lidel@lidel.org>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
Comment on lines 95 to 98
if !isCar {
http.Error(w, "only car format supported", http.StatusBadRequest)
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Also support single block requests (without paths)

Comment on lines +131 to +139
/*
Implementation iteration plan:

1. Fetch CAR into per-request memory blockstore and serve response
2. Fetch CAR into shared memory blockstore and serve response along with a blockservice that does block requests for missing data
3. Start doing the walk locally and then if a path segment is incomplete send a request for blocks and upon every received block try to continue
4. Start doing the walk locally and keep a list of "plausible" blocks, if after issuing a request we get a non-plausible block then report them and attempt to recover by redoing the last segment
5. Don't redo the last segment fully if it's part of a UnixFS file and we can do range requests
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out that this is the basic plan. At the moment 1 is done. 2 is mostly done. We don't need to land all of this before merging or doing testing with mirror traffic but we do need 4 for safety DoS purposes when being used with actually untrusted backends.

Copy link
Member

@lidel lidel Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 @aschmahmann when we start doing the walk locally, and only fetch missing parts of graph, caboose will be able to see that the requested path is different than the original Content Path requested by the user that we pass in context.

My thinking is that for such "internal" block/CAR requests, we will want to forward the parent content path to the backend HTTP server in Ipfs-Path-Affinity header.

Forwarding this hint does not cost us much, but will allow for caching/content routing optimizations (e.g. if the root of CAR i requested is not on DHT, Lassie will still be able to find it because it was able to find root CID from Ipfs-Path-Affinity, if present). This is not specific to Rhea, RAPIDE, or any other block+car client like IPFS in Chromium would benefit for this (I plan to IPIP Ipfs-Path-Affinity header at some point and add it as optional to https://specs.ipfs.tech/http-gateways/trustless-gateway/)

bsrv blockservice.BlockService
}

func NewGraphGatewayBackend(f CarFetcher, blockFetcher exchange.Fetcher, opts ...GraphGatewayOption) (*GraphGateway, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this as a thread anchor for any misc questions, but overall this code is very much WIP (with a bunch of copy-paste) to try and get this all to work reasonably.

There's a lot of cleaning up and improvements to do here. Here are a few, but feel free to recommend more.

  1. Incorrect internal abstraction usage
    • Caboose is using a blockstore, which means we have a proxy blockstore, but these are both the wrong abstraction. At best they are exchange.Fetcher's. This starts bleeding through the codebase here. Let's wrap them up nicely and isolate some of the crazy.
  2. Copy pasted code from the BlocksGateway
  • We should separate out some of the blocks gateway components (e.g. the mutability pieces have nothing to do with blocks and should be separately usable)

Some todo item's to consider in other refactors that were noticed here:
- go-path's reliance on go-fetcher to do sessions is .... bad. It effectively forces you to break up the session into "path" + other.
- go-fetcher's requirement on using a blockservice is overkill, I basically had to copy-paste it so I could give it a NodeGetter, which is fine for now. Although even that should probably be an exchange.Fetcher rather than a NodeGetter

redundant / caused 404 on some gateways
Makes it easy to test both locally and on staging.
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Pushed small fix (details below), added some debug logs, and switched to env variable GRAPH_BACKEND=true.
  • It is possible to test this against Saturn without caboose by leveraging DNS router at https://l1s.strn.pl.
    • this one-liner works pretty well for local testing: GOLOG_LOG_LEVEL="info,bifrost-gateway=debug,caboose=debug" PROXY_GATEWAY_URL="https://l1s.strn.pl" GRAPH_BACKEND=true go run .
  • The code delta is too big to assume we are still protected by Kubo sharness tests.
    Until we have a decent test coverage in gateway-conformance test suite (being added in Run new gateway-conformance tests in CI #63) we will be running blind, with potential bugs.

Planning test on Rhea staging

While this is by no means safe to run on production,
I feel the GRAPH_BACKEND=true experience with https://l1s.strn.pl is good enough to give this a try on Rhea staging next week.

For that, we need to double check wiring up in caboose.
My understanding of that work is:

  • decide if we need to clean up CarFetcher API, or leave as-is
  • inspect how caboose implements and exposes CarFetcher interface from lib/graph_gateway.go (poc in blockstore_proxy.go)
  • add/inspect car fetch metrics for CarFetcher
  • add/inspect car fetch metrics to grafana so we can observe things like size/TTFB/TTLB of CAR fetches in grafana

blockstore_proxy.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
type DataCallback = caboose.DataCallback

type CarFetcher interface {
Fetch(ctx context.Context, path string, cb DataCallback) error
Copy link
Member

@lidel lidel Mar 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 path is concatenation of IPFS Content Cath with CAR format params. This smells funny, may lead to issues, but maybe ok for now – we can revisit if gateway tests ever start failing for percent-encoded things.

Potential way to clean this up, remove URL params ?format=.. from path and have a separate carParams url.Values with optional depth and bytes (and any future ones).

This way path is always Content Path, carParams is unambiguous, and format param becomes implicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is pretty messy. Also I got it wrong here (not that it matters because there's no behavior change) since Caboose's implementation of Fetch uses the CAR accept header anyway.

As described above we should clean up and define the abstractions we want here and then pass them onto Caboose or our HTTP proxy calls.

We can use url.Values or be more specific with types, likely doesn't matter much 🤷.

Copy link
Member

@lidel lidel Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think url.Values is good enough, because Values.Encode() will preserve order.

Also fine to have own type, but the serialization is important: requests for the same thing will have parameters in the same order, and caches that use URL as cache key will work as expected.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a memory related hiccup:

  • memory issue during load test
    • using caboose (STRN_ORCHESTRATOR_URL="https://orchestrator.strn.pl/nodes/nearby?count=1000" GRAPH_BACKEND=true BLOCK_CACHE_SIZE=2) and running ab -k -n 20000 -c 1000 -w "http://en.wikipedia-on-ipfs.org.ipns.localhost:8081/wiki/Science" shows that there is a memory issue, after ~18k requests ab hangs and dies:
    Completed 18000 requests
    apr_pollset_poll: The timeout specified has expired (70007)
    Total of 19949 requests completed
      0.34s user 0.71s system 0% cpu 4:53.76 total
    

If bifrost-gateway is not killed by OOM, it continues to run, new requests work fine, but ~10GiB of memory is still used by something.

@aschmahmann aschmahmann mentioned this pull request Mar 27, 2023
13 tasks
@aschmahmann aschmahmann changed the title WIP: local backend to test against and dag fetcher local backend to test against and dag fetcher Mar 27, 2023
@lidel lidel changed the title feat: GRAPH_BACKEND feat: opt-in GRAPH_BACKEND Mar 29, 2023
@lidel
Copy link
Member

lidel commented Mar 29, 2023

EOD update:

TODO before we merge and release to staging

Plan for Tue:

  • first off all, check with George/Cameron if Grafana is stabilized and what is the window of baseline metrics (do we need to wait till Fri/Mon?)
  • finish metrics (@lidel's local changes, with fresh eyes on Tue)
  • add CAR-related metrics to bifrost-gw staging grafana
  • merge this PR and deploy to staging with GRAPH_BACKEND=true
    • IF things go wrong, restart with GRAPH_BACKEND=false
      • IF we have really bad day (due to refactor in boxo), revert to main-2023-03-09-9eac2e7 (last version before boxo refactor, before rewriting gateway library, last one we had any confidence in sharness)

Comment on lines 275 to 276
//TODO: interfaces here aren't good enough so we're getting around the problem this way
runtime.SetFinalizer(gr, func(_ *gateway.GetResponse) { closeFn() })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel some of this is due to the implementation concerns, but others are from the gateway API. Will raise a PR in boxo to discuss

- IPFSBackend metrics from ipfs/boxo#245
- GraphGateway metrics from #61
- Version for tracking rollouts
@lidel
Copy link
Member

lidel commented Mar 31, 2023

EOD update:

Next

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range requests are fixed, metrics are added.
This can be merged on Monday if we want to unblock deployment to staging,
but would like someone to look at the upstream ipfs/boxo#245 if possible.

@lidel lidel marked this pull request as ready for review April 1, 2023 01:09
@lidel lidel requested a review from hacdias April 1, 2023 01:10
@lidel lidel assigned lidel and unassigned aschmahmann Apr 1, 2023
@lidel
Copy link
Member

lidel commented Apr 3, 2023

My plan for today (after meetings):

  • switch to boxo 0.8.0-rc4 and run load test just to be safe
  • then merge this one so Docker image builds
  • then bump caboose to latest version so a second image builds (in case we have regression therE)
  • deploy to staging with GRAPH_BACKEND=true
    • IF issues, evaluate if we deploy older version, or disable graph api.
  • go over this PR and feat(gateway): IPFSBackend metrics and HTTP range support boxo#245 and fill tickets for any remaining issues

This is the updated tag after undesired binary got merged.
@lidel
Copy link
Member

lidel commented Apr 3, 2023

GitHub says there are conflicts, disables mergebutton:

2023-04-03_19-46

But AFAIK there are none:

$ git merge origin/main
Already up to date.

Update: I've done manual merge via CLI, worked fine, so looks like a GitHub hiccup 🤷

@lidel lidel merged commit 3f22e43 into main Apr 3, 2023
19 checks passed
@lidel lidel deleted the feat/dag-fetcher branch April 3, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants