New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP sketch for discussion: Idempotent/reusable UploadedFile#download approach #329
Conversation
all in all, the separate |
Note also: While downloading a second (or third) time from remote storage is bad -- really even copying the bytes from a local location to another local location (on |
2d29520
to
db1f6b3
Compare
okay, after fixing a typo... while this fails tests, it does seem to basically work in my demo app. |
Have you tried not downloading I kind of recall that the I'm assuming these are multiple |
Sorry I'm not sure what you mean, and/or think there is no API to do that. Can you give me a code example of what you mean? I may not understand what you mean. How could a metadata block see "if an io is already downloaded"? The page_count example needs a file path to pass to PDF::Reader, how could it get one without It would be great to have a solution that doesn't require the user to be very careful about shrine internals. Like |
Try something like this (not tested):
See this code allows for |
Are you suggesting the class of that As far as I can tell in my demo case, At least in my demo use case,
I double checked in my demo case. Nope, |
It doesn't respond to When I read your post, I understand you're trying to use multiple If that's not clear, then I don't follow your setup. For example, is this what you're doing?
I haven't played with ChunkedIO so we'll need to account for that. Another thought:
|
I don't understand your suggestion. I don't understand what your method The point of the example is an operation that requires a local file path. Which is what the example demos, using real code from the pdf-reader gem. (and is also an example from one of the shrine tutorial articles). It's possible there is some other way to get number of pages out of a PDF as an IO stream, I don't know. But that isn't the point of the example, the point is that there will be some metadata operations which need an actual local file with a path. Which is what the But as far as I can tell, there is no way to use It is true that if you don't need a local file with a path (such as you need in the pdf reader example), then the That is not the problem case here.
As far as I can tell, in a I'm sorry if I'm not explaining this well enough, I'm not sure how we're talking past each other. I'm finding this discussion frustrating at this point. I have spent a buncha time looking at and playing with the code; it feels like you are responding without having done so and without paying attention to the case I am describing. I'm sorry if I am not describing it clearly enough. |
No worries then. I'm only trying to help. BTW, I'm not responding without doing my homework... I'll stop since we don't follow each other. No hard feelings. 😃 |
okay, thanks, sorry I got impatient, long day. sometimes i need to remember to take a break and let it sit in the brain for a bit longer. |
Reading more code and thinking more... yeah, this whole approach is probably not good. Thinking at a higher level in the stack... Basically, an UploadedFile has an I'm in a situation where I know I'm going to need the whole file. (Becuase signature calculation). Many parts of shrine default to giving you an Perhaps there is some custom plugin I could write changing UploadedFile to be able to accommodate this. It's not super clean, but maybe there's a way. Some of the parts that assume you generally want a streaming IO instead include the io method on UploadedFile itself (which often gets passed to other things -- including any time an UploadedFile is passed as an IO-like object to a third-party library, since UploadedFile delegates read/rewind/close to Not sure, but it's probably not what's in this PR. I may try a custom local plugin at some point, because multiple retrievals (or multiple local copies even) of a 150MB+ file during promotion is problematic. |
@jrochkind @hmistry Forgive me that I haven't read everything, but I think I got the gist. It seems there are two problems we want to avoid during metadata extraction:
I would like to tackle these two problems separately. @jrochkind As you said, having I think the solution is really simple – remove My main reason for keeping If we find a way to avoid copying the file multiple files during metadata extraction, I think it's not a big deal to have the 2nd copy in Another reason I kept After the removal of
Note that this is just something I would like to do eventually – it wouldn't be blocking the removal of |
Pull request that fixes the issue with multiple downloads is here – #331. After that we can discuss what to do about multiple copies. Honestly, I would prefer that developers just group metadata extractions that require a downloaded file into the same metadata_method :pages, ...
add_metadata do |io, context|
Shrine.with_file(io) do |file|
{
"pages" => extract_pages(file),
...
}
end
end
def extract_pages(file)
# ...
end |
@janko-m For my knowledge, why wouldn't having the following line in the
|
@hmistry Because that only changes the local variable require "shrine"
require "shrine/storage/memory"
Shrine.storages[:memory] = Shrine::Storage::Memory.new
Shrine.plugin :add_metadata
Shrine.plugin :refresh_metadata
Shrine.add_metadata :foo do |io, context|
io = io.download if io.is_a?(Shrine::UploadedFile)
nil
end
Shrine.add_metadata :bar do |io, context|
p io
nil
end
uploader = Shrine.new(:memory)
uploaded_file = uploader.upload(StringIO.new("content"))
uploaded_file.refresh_metadata!
The first one is from the That reminds me, @jrochkind, the @hmistry But my proposition for resolving the copying part of this issue is going to be on that track, to reuse the downloaded Tempfile between |
And the alternative solution for preventing multiples copies is here: #332 @jrochkind Let me know what you think of this. What I disliked about the approach in this PR is that it added another instance variable that can mutate the |
@janko-m I did try it in a similar test script before suggesting it and it worked - I got the same object in Thanks for answering my question. I'll check your PR's tomorrow. |
@janko-m I agree that the approach in this PR wasn't quite right, it was just helpful to get the wheels in my brain turning and have something concrete to talk about :) It does seem to be preferable to get rid of So, it's true that avoiding extra fetches from (network) storage location is more important than avoiding extra local file copies. But my experience shows that avoiding extra local file copies is actually important too. In my current "legacy" app (which does not use shrine at all), running on AWS EC2 -- our monitoring has shown that it's actually filling up the EC2 disk channel, we get disk IO backlogs as multiple threads/processes try to read/write more bytes faster than the EC2 local disk can do (I believe even if we make it a flash drive instead of spinning disk). Our legacy code is already making more duplicate local copies of the original file (for metadata and derivatives) than are really required. This leads me to conclude that minimizing local redundant copies of bytes actually is a significant thing that matters for real world performance/behavior. Over in a comment on #332 I've suggested for discussion another possible approach -- could it possibly work to make the |
Also I see no need to hurry on the solution here, good to take our time and let it sink into our brains and evaluate various possible approaches, and be very confident in what we actually release. Already we're talking about deprecating UploadedFile#download, which I think itself was not-too-long-ago new API that became recommended instead of something else? (Was |
At the end I decided to close #332, thanks for helping prevent an unnecessary feature from getting added 👍 For reference, in #332 (comment) I explained why I don't think accessing
I agree. This is one solution for using the same file object for re-extracting metadata and processing versions: process(:store) do |io, context|
io.download do |file|
io.metadata.merge!(extract_metadata(file, context))
processor = ImageProcessing::Vips.source(file)
versions = {}
versions[:foo] = processor.resize_to_limit!(...)
versions[:bar] = processor.resize_to_limit!(...)
# ...
versions
end
end
No, we're deprecating the I'm going to close this issue, as we agreed this solution is not ideal, and e.g. it would break code that calls |
This is definitely not complete code, it's a sketch to have code to talk about instead of just words.
Use case/situation
I have an S3 cache and an S3 store, and am using direct uploads to S3. I am doing
refresh_metadata
atstore
action to get metadata on promotion (which I am also using backgrounding for).Let's say I have:
This works great. calculating a signature of course does require the whole file. The first one will end up streaming the whole file, via
down
. The second one will end up using thedown
cached file, not retrieving the whole stream from S3 again. Great, this is awesome.But now let's say I have another metadata block:
This will end up pulling all the bytes from S3 again, even though the
down
cache already had them.If I have another
add_metadata
block which also doesio.download
, it'll end up pulling all the bytes from S3 again. (Why might I have this? Because they were written as decoupled metadata extraction components that shouldn't have to know about each other, or coordinate with each other. but yes, I could always rewrite my metadata extraction to have only oneio.download
in it -- but maybe I wanna have re-usable metadata extraction routines shared between uploaders (in a gem?), the fact that they have to be coordinated complicates things).The goal/approach
Is there a way to provide an "idempotent"
#download
that will cache and always re-use any previously downloaded bytes?This would probably have to be in
UploadedFile#download
-- that's the only place that has state to cache and re-use. (It's also the only place that knows about any existing Down::ChunkedIO that may already have cached the file).If there's already a fully-cached Down::ChunkedIO, can we re-use these bytes on disk to give to the caller of
#download
? (Problems: Down::ChunkedIO doesn't really give us public/reliable API to see that we have a fully loaded cache, or to get access to the bytes on disk. We use private API as a demo. It all does end up a bit too tightly coupled betweendown
internal implementation andUploadedFile#download
, alas).If there isn't a fully-cached Down::ChunkedIO -- we're just going to have to ask the storage to
download
, we can't really re-use the partially cached Down::ChunkedIO. (Becuase storages, in particular S3, have their reasons for using a different method to implement #download, that doesn't usedown
). But can we cache and re-use these bytes, so if a second caller callsdownload
, they get the same bytes? Yes, we can...It does mean that we can't explicitly
close!
the Tempfile in question likeUploadedFile#download
did before -- the whole point is we want to leave it there so it can be re-used by a seconddownload
call. Since it's a TempFile, it should be removed by ruby when the obj is GC'd (when the UploadedFile obj itself is), which could be significant latency... but I think it's effectively the way the UploadedFile's Down::ChunkedIO cache was getting cleaned up already anyway.There are a number of failing tests. I think some of them are 'false' fails -- the tests were just reliant on internal implementation to pass, even though that internal implementation wasn't really relevant to public API. But some failing tests are probably real failing tests.
If this approach makes sense at all (and it may not), perhaps we'd want to leave the existing
UploadedFile#download
alone, and provide a newUploadedFile#idempotent_download
orUploadedFile#download(idempotent: true)
. (Not really sure "idempotent" is the right word, maybe "reusable").It's also very unclear to me how to write tests for this demonstrated new approach. :) Also this new approach probably has some concurrency concerns (two threads call
#download
on same UploadedFile concurrently, the caching gets confused) -- which possibly could be handled just by putting the whole thing in a mutex.I am not really sure if anything even resembling this approach is feasible, but providing some actual code may make it easier to talk about, I was hoping.