Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP sketch for discussion: Idempotent/reusable UploadedFile#download approach #329

Closed
wants to merge 1 commit into from

Conversation

jrochkind
Copy link
Contributor

This is definitely not complete code, it's a sketch to have code to talk about instead of just words.

Use case/situation

I have an S3 cache and an S3 store, and am using direct uploads to S3. I am doing refresh_metadata at store action to get metadata on promotion (which I am also using backgrounding for).

Let's say I have:

add_metadata do |io, context|
  {
    md5: calculate_signature(io, :md5),
    sha1: calculate_signature(io, :sha1)
  }
end

This works great. calculating a signature of course does require the whole file. The first one will end up streaming the whole file, via down. The second one will end up using the down cached file, not retrieving the whole stream from S3 again. Great, this is awesome.

But now let's say I have another metadata block:

add_metadata(:page_count) do |io, context|
  io.download do |file|
    PDF::Reader.new(file.path).page_count
  end
end

This will end up pulling all the bytes from S3 again, even though the down cache already had them.

If I have another add_metadata block which also does io.download, it'll end up pulling all the bytes from S3 again. (Why might I have this? Because they were written as decoupled metadata extraction components that shouldn't have to know about each other, or coordinate with each other. but yes, I could always rewrite my metadata extraction to have only one io.download in it -- but maybe I wanna have re-usable metadata extraction routines shared between uploaders (in a gem?), the fact that they have to be coordinated complicates things).

The goal/approach

Is there a way to provide an "idempotent" #download that will cache and always re-use any previously downloaded bytes?

This would probably have to be in UploadedFile#download -- that's the only place that has state to cache and re-use. (It's also the only place that knows about any existing Down::ChunkedIO that may already have cached the file).

If there's already a fully-cached Down::ChunkedIO, can we re-use these bytes on disk to give to the caller of #download? (Problems: Down::ChunkedIO doesn't really give us public/reliable API to see that we have a fully loaded cache, or to get access to the bytes on disk. We use private API as a demo. It all does end up a bit too tightly coupled between down internal implementation and UploadedFile#download, alas).

If there isn't a fully-cached Down::ChunkedIO -- we're just going to have to ask the storage to download, we can't really re-use the partially cached Down::ChunkedIO. (Becuase storages, in particular S3, have their reasons for using a different method to implement #download, that doesn't use down). But can we cache and re-use these bytes, so if a second caller calls download, they get the same bytes? Yes, we can...

It does mean that we can't explicitly close! the Tempfile in question like UploadedFile#download did before -- the whole point is we want to leave it there so it can be re-used by a second download call. Since it's a TempFile, it should be removed by ruby when the obj is GC'd (when the UploadedFile obj itself is), which could be significant latency... but I think it's effectively the way the UploadedFile's Down::ChunkedIO cache was getting cleaned up already anyway.

There are a number of failing tests. I think some of them are 'false' fails -- the tests were just reliant on internal implementation to pass, even though that internal implementation wasn't really relevant to public API. But some failing tests are probably real failing tests.

If this approach makes sense at all (and it may not), perhaps we'd want to leave the existing UploadedFile#download alone, and provide a new UploadedFile#idempotent_download or UploadedFile#download(idempotent: true). (Not really sure "idempotent" is the right word, maybe "reusable").

It's also very unclear to me how to write tests for this demonstrated new approach. :) Also this new approach probably has some concurrency concerns (two threads call #download on same UploadedFile concurrently, the caching gets confused) -- which possibly could be handled just by putting the whole thing in a mutex.

I am not really sure if anything even resembling this approach is feasible, but providing some actual code may make it easier to talk about, I was hoping.

@jrochkind
Copy link
Contributor Author

all in all, the separate download implementation in storage is unfortunate. If a Down::ChunkedIO has already been half-downloaded, ideally it might just "flush" it to get the whole thing (which there's no API for anyway though) and be able to re-use those bytes....

@jrochkind
Copy link
Contributor Author

Note also: While downloading a second (or third) time from remote storage is bad -- really even copying the bytes from a local location to another local location (on #download) is also not good. When the file is 150MB, this is still an "expensive" thing to do, which can end up plugging up IO channels on an EC2 if it's being done a lot too. If we've already got the bytes on local disk, we want to use them where they are without copying them again, not just without pulling down from remote network again.

@jrochkind
Copy link
Contributor Author

okay, after fixing a typo... while this fails tests, it does seem to basically work in my demo app.

@hmistry
Copy link
Member

hmistry commented Nov 27, 2018

Have you tried not downloading io if io is already downloaded in add_metadata(:page_count)?

I kind of recall that the #extract_metadata method iterates through an array of added_metadata_procs and passes io into each one so if the first downloads io then it uses that in the next one if it's doing multiples. It does an io.rewind before next one.

I'm assuming these are multiple add_metadata blocks in the same uploader.

@jrochkind
Copy link
Contributor Author

jrochkind commented Nov 27, 2018

Have you tried not downloading io if io is already downloaded in add_metadata(:page_count)?

Sorry I'm not sure what you mean, and/or think there is no API to do that. Can you give me a code example of what you mean? I may not understand what you mean. How could a metadata block see "if an io is already downloaded"? The page_count example needs a file path to pass to PDF::Reader, how could it get one without download?

It would be great to have a solution that doesn't require the user to be very careful about shrine internals. Like down itself just kind of does it for you. But I'm not really sure what you are suggesting.

@hmistry
Copy link
Member

hmistry commented Nov 27, 2018

Try something like this (not tested):

add_metadata(:page_count) do |io, context|
  file = io.responds_to(:path) ? io : io.download # basically checking if io is tempfile or uploadedfile type
  PDF::Reader.new(file.path).page_count
end

See this code allows for io re-use between subsequent metadata block calls.

@jrochkind
Copy link
Contributor Author

Are you suggesting the class of that io object should be different depending on... what?

As far as I can tell in my demo case, io will never respond_to? path, regardless of whether the Down::ChunkedIO has already fully cached the file and/or whether io.download has previously been called.

At least in my demo use case, io is always an UploadedFile, it doesn't change class depending on whether the file has already been fully cached by an internal Down::ChunkIO or not, and/or whether io.download had previously been called or not (perhaps by a previous metadata block).

UploadedFile does not have a path method.

io.to_io is a Down::ChunkedIO. This also does not have a path method. It does have a file-backed cache with a path, but that is not exposed with public API. Even if it were exposed with public API, it would not tell you if the file-backed cache had the complete contents or not.

I double checked in my demo case. Nope, io.respond_to?(:path) is never true.

@hmistry
Copy link
Member

hmistry commented Nov 28, 2018

It doesn't respond to path because it's ChunkedIO so I'd pick a different test.

When I read your post, I understand you're trying to use multiple add_metadata blocks but want the io to use the downloaded file and not have each metadata block re-download the file from S3. (You're seeing it re-download because you're specifically calling it.)

If that's not clear, then I don't follow your setup. For example, is this what you're doing?

class MyUploader < Shrine
  plugin :add_metadata

  add_metadata do |io, context|
    {
      md5: "abc",
      sha1: "xyz"
    }
  end

  add_metadata(:page_count) do |io, context|
   # this will use same io obj as previous metadata block and does not download a new copy
    number_of_pages(io)
  end
end

I haven't played with ChunkedIO so we'll need to account for that.

Another thought:
Just saw the comment above your edits which says tempfile is deleted after the uploadedfile.download block... which is probably what's happening in your case. The way I'd solve that is I'd do something like this:

class MyUploader < Shrine
  plugin :add_metadata

  add_metadata do |io, context|
    io = io.download if io.is_a?(UploadedFile)
    {
      md5: "abc",
      sha1: "xyz"
    }
  end

  add_metadata(:page_count) do |io, context|
   # this will use same io obj as previous metadata block and does not download a new copy
    number_of_pages(io)
  end
end

@jrochkind
Copy link
Contributor Author

jrochkind commented Nov 28, 2018

I don't understand your suggestion. I don't understand what your method number_of_pages(io) is supposed to do.

The point of the example is an operation that requires a local file path. Which is what the example demos, using real code from the pdf-reader gem. (and is also an example from one of the shrine tutorial articles). It's possible there is some other way to get number of pages out of a PDF as an IO stream, I don't know. But that isn't the point of the example, the point is that there will be some metadata operations which need an actual local file with a path. Which is what the io.download function is for, I think?

But as far as I can tell, there is no way to use io.download without making an extra complete fetch from remote location every time you call io.download. There is also no other way I can find to get a local file path for a complete download, without using io.download.

It is true that if you don't need a local file with a path (such as you need in the pdf reader example), then the io object (a Down::ChunkedIO) will be re-usable without re-downloading any bytes, or making any local file copies.

That is not the problem case here.

    io = io.download if io.is_a?(UploadedFile)

As far as I can tell, in a add_metadata block, io is always an UploadedFile, at least in my direct-upload example. So I don't think that condition makes any sense. The problem is that the io.download will make an extra retrieval of the complete file from the remote location, writing it to a local location, even if a complete retrieval had already been done, such as it would have been in my example by the calculate_signature metadata block.

I'm sorry if I'm not explaining this well enough, I'm not sure how we're talking past each other. I'm finding this discussion frustrating at this point. I have spent a buncha time looking at and playing with the code; it feels like you are responding without having done so and without paying attention to the case I am describing. I'm sorry if I am not describing it clearly enough.

@hmistry
Copy link
Member

hmistry commented Nov 28, 2018

No worries then. I'm only trying to help. BTW, I'm not responding without doing my homework... I'll stop since we don't follow each other. No hard feelings. 😃

@jrochkind
Copy link
Contributor Author

okay, thanks, sorry I got impatient, long day. sometimes i need to remember to take a break and let it sit in the brain for a bit longer.

@jrochkind
Copy link
Contributor Author

jrochkind commented Nov 28, 2018

Reading more code and thinking more... yeah, this whole approach is probably not good.

Thinking at a higher level in the stack... Basically, an UploadedFile has an open method which returns a stream (usually a Down::ChunkedIO), and a download method that returns a Tempfile. (Although a new one, usually freshly retrieved from storage, every time you call it). It does that by basically calling the respective methods on the underlying storage.

I'm in a situation where I know I'm going to need the whole file. (Becuase signature calculation).

Many parts of shrine default to giving you an io obtained from open, where I want to be able to say "in THIS case, give me an io obtained from download instead, an actual Tempfile or File. And further, if called multiple times re-use the bytes on disk, don't retrieve it from storage again -- make a new File object off the same bytes on disk maybe."

Perhaps there is some custom plugin I could write changing UploadedFile to be able to accommodate this. It's not super clean, but maybe there's a way.

Some of the parts that assume you generally want a streaming IO instead include the io method on UploadedFile itself (which often gets passed to other things -- including any time an UploadedFile is passed as an IO-like object to a third-party library, since UploadedFile delegates read/rewind/close to io), and refresh_metadata plugin

Not sure, but it's probably not what's in this PR. I may try a custom local plugin at some point, because multiple retrievals (or multiple local copies even) of a 150MB+ file during promotion is problematic.

@janko
Copy link
Member

janko commented Dec 2, 2018

@jrochkind @hmistry Forgive me that I haven't read everything, but I think I got the gist.

It seems there are two problems we want to avoid during metadata extraction:

  • downloading the same file multiple times
  • copying the same file multiple times

I would like to tackle these two problems separately. @jrochkind As you said, having #download of an opened Shrine::UploadedFile retrieve the remote file multiple times from the storage is pretty bad (while only copying is not as bad). So I would like to tackle that problem first.

I think the solution is really simple – remove Storage#download. Deciding whether to keep Storage#download alongside Storage#open is something I struggled with. My main reason for dropping #download was that I want to have Storage#open as a "single source of truth"; I don't want Storage#download to have possibly different behaviour from Storage#open. And it's easier to implement a storage implementation when you have one method less to implement.

My main reason for keeping Storage#download is that, if you call UploadedFile#download when storage implements only #open using Down::ChunkedIO, the file will be copied twice; once to the result Tempfile and once to Down::ChunkedIO's internal cache (which is necessary to make it rewindable). Since Shrine doesn't know whether Storage#open uses Down::ChunkedIO (and whether Storage#open accepts a :rewindable option that is forwarded to Down::ChunkedIO, like Storage::S3 does). I still want Shrine not to know that, as I believe that would introduce unnecessary coupling.

If we find a way to avoid copying the file multiple files during metadata extraction, I think it's not a big deal to have the 2nd copy in Down::ChunkedIO. I think the tradeoff compared to the benefits of getting rid of Storage#download is worth it. @jrochkind How does that sound?

Another reason I kept S3#download is that currently it automatically retries failed downloads, while S3#open doesn't. This is due to limitations of aws-sdk-s3; if you give it a Tempfile as the "download destination", in case of network error it will #truncate it and start the download over, but if you give it a block, it will not retry because a block is not "truncatable". But this can be solved in aws-sdk-s3 gem – in fact, I already fixed it half-way in aws/aws-sdk-ruby#1617.


After the removal of Storage#download, ticking off the following checkboxes should put us in a good place:

Note that this is just something I would like to do eventually – it wouldn't be blocking the removal of Storage#download nor releasing a new version of Shrine.

@janko janko mentioned this pull request Dec 2, 2018
@janko
Copy link
Member

janko commented Dec 2, 2018

Pull request that fixes the issue with multiple downloads is here – #331.

After that we can discuss what to do about multiple copies. Honestly, I would prefer that developers just group metadata extractions that require a downloaded file into the same add_metadata block:

metadata_method :pages, ...

add_metadata do |io, context|
  Shrine.with_file(io) do |file|
    {
      "pages" => extract_pages(file),
      ...
    }
  end
end

def extract_pages(file)
  # ...
end

@hmistry
Copy link
Member

hmistry commented Dec 2, 2018

@janko-m For my knowledge, why wouldn't having the following line in the add_metadata work? I think it solves the issue because the issue is caused by having to use io.download which then deletes the Tempfile after its block is terminated so there's no continuity in io between metadata extraction blocks.

  add_metadata do |io, context|
    io = io.download if io.is_a?(UploadedFile)
    ....
  end

@janko
Copy link
Member

janko commented Dec 2, 2018

@hmistry Because that only changes the local variable io inside the block, it doesn't change what is going to be passed to other blocks. You can try it and see that it doesn't work:

require "shrine"
require "shrine/storage/memory"

Shrine.storages[:memory] = Shrine::Storage::Memory.new

Shrine.plugin :add_metadata
Shrine.plugin :refresh_metadata

Shrine.add_metadata :foo do |io, context|
  io = io.download if io.is_a?(Shrine::UploadedFile)
  nil
end

Shrine.add_metadata :bar do |io, context|
  p io
  nil
end

uploader = Shrine.new(:memory)
uploaded_file = uploader.upload(StringIO.new("content"))
uploaded_file.refresh_metadata!
#<StringIO:0x00007fe64c04d618>
#<Shrine::UploadedFile:0x00007fe64c04cba0 @data={"id"=>"4f112947455c08b6f45693dd2d77bd82", "storage"=>"memory", "metadata"=>{"filename"=>nil, "size"=>7, "mime_type"=>nil}}, @io=#<StringIO:0x00007fe64c04c920>>

The first one is from the Shrine#upload call, and the second one is from UploadedFile#refresh_metadata!.

That reminds me, @jrochkind, the add_metadata blocks will also be called with raw files (e.g. ActionDispatch::Http::UploadedFile) when you're attaching them directly, when you're not using direct uploads. That's why you can see io.path in some examples, because in that case io is some kind of file instance, and you're right that it wouldn't work with a Shrine::UploadedFile object (because they don't respond to #path).

@hmistry But my proposition for resolving the copying part of this issue is going to be on that track, to reuse the downloaded Tempfile between add_metadata blocks. I'll make a pull request soon, so you and @jrochkind can tell me what you think.

@janko
Copy link
Member

janko commented Dec 3, 2018

And the alternative solution for preventing multiples copies is here: #332

@jrochkind Let me know what you think of this. What I disliked about the approach in this PR is that it added another instance variable that can mutate the Shrine::UploadedFile object. And I really want to avoid adding new mutable state when possible, because I feel like it makes it easier to introduce bugs later on. I think having a mutable @io is more than enough for Shrine::UploadedFile.

@hmistry
Copy link
Member

hmistry commented Dec 3, 2018

@janko-m I did try it in a similar test script before suggesting it and it worked - I got the same object in io and verified the object_id too. I'll check my test script later to see where I may have made a mistake.

Thanks for answering my question. I'll check your PR's tomorrow.

@jrochkind
Copy link
Contributor Author

jrochkind commented Dec 3, 2018

@janko-m I agree that the approach in this PR wasn't quite right, it was just helpful to get the wheels in my brain turning and have something concrete to talk about :)

It does seem to be preferable to get rid of download if we can -- having only one code path of "getting bytes" to deal with is a simpler implementation, which makes it easier for us to deal with any bugs/features/problems/desires, than having both open and download. I agree that resumability specifically with S3 probably isn't enough to justify the separate download path -- although even better if resumability can be managed in open path too, per your PR to AWS SDK.

So, it's true that avoiding extra fetches from (network) storage location is more important than avoiding extra local file copies.

But my experience shows that avoiding extra local file copies is actually important too. In my current "legacy" app (which does not use shrine at all), running on AWS EC2 -- our monitoring has shown that it's actually filling up the EC2 disk channel, we get disk IO backlogs as multiple threads/processes try to read/write more bytes faster than the EC2 local disk can do (I believe even if we make it a flash drive instead of spinning disk). Our legacy code is already making more duplicate local copies of the original file (for metadata and derivatives) than are really required. This leads me to conclude that minimizing local redundant copies of bytes actually is a significant thing that matters for real world performance/behavior.

Over in a comment on #332 I've suggested for discussion another possible approach -- could it possibly work to make the Down IO object, which in most cases is already caching it's bytes on disk, just expose that local temp file through public API? Handle it all with a fundamental utility at the lower level -- which would really limit the number of copies of bytes to one (so long as they share a Down object) -- instead of trying to make multiple custom API for every place that you might want a local file (metadata, derivatives, other future).

@jrochkind
Copy link
Contributor Author

Also I see no need to hurry on the solution here, good to take our time and let it sink into our brains and evaluate various possible approaches, and be very confident in what we actually release.

Already we're talking about deprecating UploadedFile#download, which I think itself was not-too-long-ago new API that became recommended instead of something else? (Was Shrine.file_path recommended before? I wonder if all docs really should have kept using Shrine.file_path() as the suggested path, instead of some right now doing io.path and some right now doing io.download and maybe more...) Frequent changes to recommended ways to do this (or inconsistencies between different docs) are confusing to the developer.

@janko
Copy link
Member

janko commented Dec 12, 2018

At the end I decided to close #332, thanks for helping prevent an unnecessary feature from getting added 👍

For reference, in #332 (comment) I explained why I don't think accessing Down::ChunkedIO's internal Tempfile object is a good idea, and that a new security measure will make it impossible.

This leads me to conclude that minimizing local redundant copies of bytes actually is a significant thing that matters for real world performance/behavior.

I agree. This is one solution for using the same file object for re-extracting metadata and processing versions:

process(:store) do |io, context|
  io.download do |file|
    io.metadata.merge!(extract_metadata(file, context))

    processor = ImageProcessing::Vips.source(file)
    versions  = {}

    versions[:foo] = processor.resize_to_limit!(...)
    versions[:bar] = processor.resize_to_limit!(...)
    # ...

    versions
  end
end

Already we're talking about deprecating UploadedFile#download, which I think itself was not-too-long-ago new API that became recommended instead of something else?

No, we're deprecating the #download method on the storage, not on Shrine::UploadedFile. Shrine::UploadedFile#download will still continue to exist normally, but now it will only use the #open storage method.

I'm going to close this issue, as we agreed this solution is not ideal, and e.g. it would break code that calls #download multiple times and deletes the Tempfile.

@janko janko closed this Dec 12, 2018
@janko janko deleted the idempotent_download branch February 17, 2019 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants