Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAR offset writer #290

Closed
wants to merge 1 commit into from
Closed

CAR offset writer #290

wants to merge 1 commit into from

Conversation

dirkmc
Copy link

@dirkmc dirkmc commented Jan 24, 2022

This PR introduces two classes:

  • CarOffsetWriter translates from a blockstore + root CID -> car. The ‘offset’ part is to allow for resumption where it can skip forward some amount in the car before writing.
  • CarReaderSeeker provides an io.ReadSeeker over this interface.


var _ io.ReadSeeker = (*CarReaderSeeker)(nil)

func NewCarReaderSeeker(ctx context.Context, payloadCid cid.Cid, bstore blockstore.Blockstore, size uint64) *CarReaderSeeker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you confident you'll know the overall size of the car at construction time here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are for the Estuary use case - if we don't know it then it makes seeking from the end infeasible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to seek from the end? the current read-seeker in this library disallows seeking from end since it doesn't know total length in advance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http.ServeContent() gets the size of the file by seeking to the end of it. I couldn't see a way to get around that without copying a whole bunch of code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we could do something like providing the size of the file as an option to CarReaderSeeker, and if it's not set we disallow seeking from the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I guess worth including then

Copy link
Contributor

@aarshkshah1992 aarshkshah1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirkmc Some comments from a first review

v2/car_offset_writer.go Show resolved Hide resolved
go func() {
err := c.cow.Write(writeCtx, pw, uint64(c.offset))
if err != nil && !xerrors.Is(err, context.Canceled) {
pw.CloseWithError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return an error even if the context is cancelled as this error gets propogated to the read side and if we don't pass an error on context cancellation, the read side will wrongly interpret it as an EOF.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little convoluted but:

  • Read and Seek should not be called concurrently
  • The context is only cancelled by Seek
  • Therefore any reads will have finished when the context is cancelled
  • Therefore there should not be a read in progress anyway. We just want to discard any bytes that have been written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary that the context is only cancelled by Seek. You also pass in a parent context in there which would be cancelled by the creator of ReadSeeker. Therefore, context cancellation errors on write should be propagated to the reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I think I agree with this, you're accepting a parent context and it's making its way down here, the cancellation could be for any number of reasons very far from this API, so having that bubble back up would seem to be the appropriate behaviour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... actually now I'm looking at Seek() it's not that simple is it? we're explicitly using the cancel on the new context created for this in the case of a seek that comes in while we're writing. So the cancellation could come from internal, in Seek() or external from parentCtx.

But, the cancellation from Seek() shouldn't ever show up from the CloseWithError() here since the reader is discarded there. So it shouldn't matter if we pass it on since there's no check for it. But since it is passed on with a standard Read(), then the parentCtx cancellation error will propagate, but will currently be missed.

So it's still correct to pass the error on regardless of where it comes from; as long as we keep it clear in the API docs that this isn't threadsafe (which you've done, but we need to retain when we add more docs here).

v2/car_reader_seeker.go Show resolved Hide resolved
v2/car_reader_seeker.go Show resolved Hide resolved
@BigLep
Copy link
Contributor

BigLep commented Jun 29, 2022

What are the next steps for getting this merged?

@rvagg
Copy link
Member

rvagg commented Jun 30, 2022

The two main things I can see are:

  1. Docs, even just basic descriptions for the public API pieces.
  2. Resolution about the review discussion about errors: CAR offset writer #290 (comment) - as I've said there I agree with Aarsh (unless I'm missing some important detail)

Something else I'd like to discuss is the naming - CarOffsetWriter writer isn't a great name for what this is. It certainly didn't match my expectations when I started reading the code and seeing what it did! I'd normally interpret an "offset" as a read thing, and if I'm writing then it's a seek+write where you're seeking on the writer side, not the source, i.e. not a "swallow write these offset bytes" but "skip ahead before you start writing all your bytes". Maybe there's Go prior-art on this kind of API I'm not aware of? This API is really a CarWriterFromBlockstore or BlockstoreToCarWriter or something like that, emphasis being on the blockstore bit. It might be pretty useful even without that offset so I'm wondering if we should introduce an Option here for the Write call? Or maybe better: two calls - Write(context.Context, io.Writer) and WriteFrom(context.Context, io.Writer, uint64).

The merkledag.Walk is a bit sad, but I guess we already have a bit of that in here already and it's efficient for a full graph traversal. I could imagine this being extended with a selector so it becomes a bit like SelectiveCar; then you can do CAR retrievals with a selector and even ADLs. But that could be added later.

@willscott
Copy link
Member

My question is if we need this, or if we can use #291 as a replacement for this functionality but staying closer to the ipld-prime pattern. my understanding is that #291 is likely sufficient and doesn't introduce the dependencies on ipld-format into this library

@masih masih force-pushed the feat/car-offset-writer branch 2 times, most recently from 8171360 to 3d96944 Compare July 6, 2022 09:34
@BigLep
Copy link
Contributor

BigLep commented Jul 19, 2022

@masih : can you please allocate some time to talk with @rvagg and figure out whether we go with this approach or #291 . I'd like to get this resolved given the complications it caused when resolving GHSA-9x4h-8wgm-8xfg . Basically, lets not have Boost depending on a custom branch of go-care :)

@rvagg
Copy link
Member

rvagg commented Aug 1, 2022

This has been raised by @hannahhoward as the possible cause of the problems Estuary is having with making deals and getting consistent CommP. A reason that might be plausible is that those problems seem to be mainly (but not exclusively) for larger deals, which maybe are more prone to needing resumption in HTTP transfers?

  • The go-merkledag/Walk use should produce a consistent DAG ordering to go-ipld-prime's traversal. It recursively walks down left-first through links and the code here uses the getLinks callback to decide which blocks have been visited. There's also no concurrency option supplied so the parallel-walk code isn't getting in the way.
  • However, I am concerned about the treatment of duplicate blocks: https://github.com/ipld/go-car/pull/290/files#diff-5af883f15b3483139a3ab0f7fb9b305913601402e7623071a4613daf57e14f62R91-R102 - I think this code is conflating "I've already seen this block" to decide where to write, by messing with offset. But it's normal for DAGs to contain links to the same CID throughout the graph, even for UnixFS DAGs. I think it would be appropriate to add in seen-CID tracking (like v0 SelectiveCar (and also in writingReader) in here on top of the offset-caching (or remove the offset caching? I'm not really sure why writeBlocks would be called multiple times, maybe the answer is in Boost). Or, a better way might be to move this logic out of the getLinks (nextCid) callback and into that seen.Visit argument since that's really where it should be done, with getLinks just .. getting links.

@BigLep
Copy link
Contributor

BigLep commented Oct 4, 2022

2022-10-04 IPLD triage conversation: @rvagg is going to write up the current state and next steps.

@rvagg
Copy link
Member

rvagg commented Oct 20, 2022

Closing this now that Boost is using a standard release and not a custom branch off this repo. Boost contains CarOffsetWriter locally, so this functionality is available there.

@willscott and I will work to get #291 sorted, in some form, that provides a viable alternative to this. That ball's in my court atm, there's a bit of a limbo state in #327 that I dropped the ball on because I started to question whether the complexity to make it all work properly was worth it. So that's my TODO.

@rvagg rvagg closed this Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🎉 Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants