Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer file reading until package build time #107

Open
dralley opened this issue Mar 22, 2023 · 9 comments
Open

Defer file reading until package build time #107

dralley opened this issue Mar 22, 2023 · 9 comments

Comments

@dralley
Copy link
Collaborator

dralley commented Mar 22, 2023

with_file() and with_file_async() currently read out the contents of the file into a buffer and persist them inside of RPMFileEntry structs, which are stored in the builder struct, until the package is built. This is simple and convenient but terribly inefficient, as all of the the uncompressed file contents will be stored in memory. Additionally this API allows little possibility of parallelism when building a single package - since each package can be dealing with dozens, hundreds or thousands of files and only a few packages are generally built at a time, this is probably a bad trade.

Instead of storing a collection of eagerly-processed RPMFileEntrys inside the builder, we should probably try to store only the filename and RPMFileOptions, and process all of the files at once during build time, which would make it easier to parallelize using a threadpool or async primitives.

This will allow the files to be read and written directly into the archive using smaller buffers, and we can calculate details like file metadata, digests, signatures and such at the same time.

@dralley
Copy link
Collaborator Author

dralley commented Mar 22, 2023

@drahnr @cmeister2 Does this sound OK to you? Can you see any problems with the idea?

@drahnr
Copy link
Contributor

drahnr commented Mar 22, 2023

My main issue with the outlined approach, is the fact that we're not able to provide a reader but make it internal to the fn build(..). So we hard code files iiuc, which I'd rather not.

We don't necessarily have to read the files when passed to with_file or with_file_async, but could defer the reading to the build invocation, while still maintaining passing impl BufRead (or it's async equiv).

@dralley
Copy link
Collaborator Author

dralley commented Mar 22, 2023

Sure, I'm not tied strongly to specific implementation details, so long as we're not storing the entire contents of files in buffers in the builders.

Could you elaborate on this aspect, I'm not sure I understand what you mean.

we're not able to provide a reader but make it internal to the fn build(..). So we hard code files iiuc, which I'd rather not.

@drahnr
Copy link
Contributor

drahnr commented Mar 22, 2023

My assumption was you wanted to use std::fs::File internally.

@dralley
Copy link
Collaborator Author

dralley commented Mar 22, 2023

If by internally you mean on the builder struct then no, I just want to store the path, so that build() (or presumably build_async()) can use whatever method it wants to read the files.

Instead of storing a collection of eagerly-processed RPMFileEntrys inside the builder, we should probably try to store only the filename and RPMFileOptions

@cmeister2
Copy link
Collaborator

No immediate issue with the design. If I squinted enough with a security hat on I could potentially see an issue with storing a path and then reading from that path at a later time, if there was a sufficient gap between storing and then reading (and if that file can be replaced in the meantime), but I don't think that's a sufficient worry.

@drahnr
Copy link
Contributor

drahnr commented Mar 22, 2023

If by internally you mean on the builder struct then no, I just want to store the path, so that build() (or presumably build_async()) can use whatever method it wants to read the files.

Instead of storing a collection of eagerly-processed RPMFileEntrys inside the builder, we should probably try to store only the filename and RPMFileOptions

That's what I meant, you store the path, but then you already looked into using File in fn build().

@dralley
Copy link
Collaborator Author

dralley commented Mar 22, 2023

That's what I meant, you store the path, but then you already looked into using File in fn build().

In theory you would have a build() and a build_async() (etc.) the same way there is currently with_file_async(). But that is necessary even if you were to pass impl BufRead or the async equivalent because the act of doing the reads is what needs to be async. Also, since the interfaces are different, that approach would pollute the RpmBuilder with additional complexity.


(Separately) I'm still not sure I appreciate the point of async for file IO though.

  • Gets delegated to synchronous calls running inside a threadpool anyway
  • You probably don't want to build or parse RPMs in an async context where you care about latency (e.g. webserver), because calculating / verifying digests, signatures, and particularly level-19 zstd compression (!!!) is computationally heavy, and the long computational events effectively cause "blocking". This is especially true for any single-threaded async runtime.
  • Ergonomically speaking it's not that much of an improvement over spawn_blocking(|| ...)

On point 2, I don't have hard data on this, and I should probably do some profiling so that I'm basing this on evidence. I just want to call out that the tokio docs suggest not doing anything compute-bound inside an async runtime, and I have a feeling this may qualify.

@drahnr
Copy link
Contributor

drahnr commented Mar 23, 2023

fn build_async(..) could just consume a threadpool on which the packaging is placed from an execution perspective.

My main concern is the fact hardcoding std::path::Path pointing to local filesystem paths into the API, I'd like to see impl BufRead/impl Read being the only bound for file contents (or their async variants).

I also believe, we should consider splitting the sync and async builder to some extent, or provide a splitting function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants