-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file: true
feature to add_metadata
#332
Conversation
4be0616
to
fd4b59a
Compare
Sometimes to extract metadata the user needs a file object which has a `#path`. Previously the recommended approach was to call `Shrine.with_file` directly: add_metadata :custom do |io, context| Shrine.with_file(io) do |file| # ... end end Other than being a bit verbose, this introduces a performance issue when multiple metadata blocks call Shrine.with_file and refresh_metadata plugin is used. Because UploadedFile#download doesn't implement any kind of memoization, the same file will unnecessarily be copied to disk multiple times. To solve this, we add the ability to pass `file: true` to #add_metadata, in which case Shrine will internally call `Shrine.with_file` around the whole metadata extraction, so the copying to disk will happen only once. add_metadata :custom, file: true do |file, context| file # this is now a file object end
fd4b59a
to
84ea8c2
Compare
I wonder if a lower-level API solution could be possible, that's more flexibly composable. I potentially need "local file path" access not just in metadata generation, but in versions/derivatives extraction as well. While one could write a particular versions/derivatives API to support it similar to what you've done here with "extract metadata", I think it would be ideal if we could support a composable utility that could be used with any API, including custom local APIs, future not yet thought of APIs, etc. Also ideally one that could prevent even a single extra-copy of the file, as well as extra downloads of bytes from storage location. And would this current architecture result in at least one more extra retrieval for versions/derivatives too? Not sure. Here's one thing I've been thinking of in response to my own investigations, letting it sit in my brain a bit, and these discussions.... What if
Down can already promise (and implementation probably already does?) that the tempfile path will remain good as long as the returned down IO object is in scope. (I believe Additionally note that at least on any "unixy" file system/OS, if there's a file handle to an on-disk file, even if some other code has deleted that file -- the file handle will remain good, and the bytes still undisturbed on disk, until the file handle has closed. So if a There might be reasons what I'm suggesting wouldn't work or would be tricker than I'm thinking. But if it could work, i think it has the advantages:
|
Oh, right, I didn't realize this is another place where the file would be copied. Ok, this is not an ideal solution then, though it's still useful for reducing copies during "regular" metadata extraction (which happens if you're attaching a raw file). But you gave me a lot to think about.
I just want to point out that this is an additional 3rd optimisation that we could add, from our discussions at #329. I think it doesn't affect whether we decide to merge #329, and if we do merge it but don't add this additional optimization we would still have gained a big performance improvement. The reason I wanted to state this is because I don't think this IO.copy_stream(Down.open(url), Tempfile.new)
# has same execution time as
IO.copy_stream(Down.open(url, rewindable: false), Tempfile.new) But writing to an additional file does increase disk IO, which can affect real world performance, as you explained in #329 (comment) (thank you for sharing that insight). However, if An additional reason why I don't want to expose
The Tempfile is closed and unlinked on
This is an interesting observation, I will keep it in mind. I still need more time to come up with additional arguments for or against merging this PR. On one hand I like that it reduces number of copies when assigning raw files. On the other it only partially solves your use case. When you download the file to make derivatives, you'd still be making an additional download. I'm inclined to suggest this for your use case: process(:store) do |io, context|
io.download do |file|
io.metadata.merge!(extract_metadata(file, context))
processor = ImageProcessing::Vips.source(file)
versions = {}
versions[:foo] = processor.resize_to_limit!(...)
versions[:bar] = processor.resize_to_limit!(...)
# ...
versions
end
end Your solution in #329 definitely covers more ground and removes mental overhead from the user. But I'm still worried that it might lead to potential bugs depending on what users do with the Tempfile. |
After some more thinking, I don't like this solution after all. When assigning raw IO objects, if people want to avoid multiple copies, they should just convert their IO object to a file object first if it's not already (e.g. For your use case it would help, but not enough, because you'd still have to make an additional copy either way for processing versions/derivatives, which is undesirable. So I'll close this then and continue the discussion in #329. |
Sometimes to extract metadata the user needs a file object which has a
#path
. Previously the recommended approach was to callShrine.with_file
directly:Other than being a bit verbose, this introduces a performance issue when multiple metadata blocks call
Shrine.with_file
andrefresh_metadata
plugin is used. BecauseUploadedFile#download
doesn't implement any kind of memoization, the same file would unnecessarily be copied to disk multiple times.To solve this, we add the ability to pass
file: true
to#add_metadata
, in which case Shrine will internally callShrine.with_file
around the whole metadata extraction, so the copying to disk will happen only once during whole metadata extraction.This is an alternative fix for the problem described in #329.