Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop "cpio" libraries and write something semi-custom, because RPM doesn't use vanilla CPIO #108

Open
dralley opened this issue Mar 23, 2023 · 11 comments · May be fixed by #109
Open

Drop "cpio" libraries and write something semi-custom, because RPM doesn't use vanilla CPIO #108

dralley opened this issue Mar 23, 2023 · 11 comments · May be fixed by #109

Comments

@dralley
Copy link
Collaborator

dralley commented Mar 23, 2023

See the "Payload" section of the website: https://rpm-software-management.github.io/rpm/manual/format.html

Payload

The Payload is currently a cpio archive, gzipped by default. The cpio archive type used is SVR4 with a CRC checksum.

As cpio is limited to 4 GB (32 bit unsigned) file sizes RPM since version 4.12 uses a stripped down version of cpio for packages with files > 4 GB. This format uses 07070X as magic bytes and the file header otherwise only contains the index number of the file in the RPM header as 8 byte hex string. The file metadata that is normally found in a cpio file header - including the file name - is completely omitted as it is stored in the RPM header already.

So, we should fork cpio-rs (providing the appropriate credits of course), strip it down to the subset we need, and change the magic bytes constant.

Luckily the CPIO format is pretty simple and the library only a few hundred lines, so it's not a big deal.

Subsequently we need to change the PAYLOADFORMAT tag, but upstream RPM still uses cpio as the name, so we'll have to wait until they pick something.

@dralley dralley linked a pull request Mar 23, 2023 that will close this issue
5 tasks
@drahnr
Copy link
Contributor

drahnr commented Apr 14, 2023

A real concern: Do we want to support rpms with cpio-like archives larger than 4GB? It feels like we pull in a lot of pain for supporting an antipattern? Are there use-cases that are idiomatic that require rpms larger than 4GB?

@dralley
Copy link
Collaborator Author

dralley commented Apr 14, 2023

@drahnr The example that typically comes up is games, which often include many large assets, or ML models, or their training data. In practice those are rarely distributed as system packages but it is possible and has been done.

@drahnr
Copy link
Contributor

drahnr commented Apr 14, 2023

My question: Are we anticipating this crate being used for games, using rpm-rs rather than rpmbuild? Resources are limited, and this doesn't hit me as good return on investment of those.

@dralley
Copy link
Collaborator Author

dralley commented Apr 14, 2023

It's not just a matter of writing but also reading. I'm not sure I want to assume that nobody will ever want to use this crate to process the contents of existing such RPMs.

I don't know that it's such a drain on resources. cpio is pretty simple, the code for both reading and writing them is only about 400 lines excluding tests and is pretty stable.

@drahnr
Copy link
Contributor

drahnr commented Apr 14, 2023

Tbh, I'd prefer we create a separate rpm-cpio in the org, rather than moving it into the codebase, and just replace the dependency. Does that sound fair? We can then go forward and rebase on any upstream changes as needed rather than having to backport code manually.

@newpavlov
Copy link
Contributor

I also think that a separate crate would be a better approach. Maybe you should create a repository for it?

@dralley
Copy link
Collaborator Author

dralley commented Jul 14, 2023

I'm a bit lukewarm on having a separate crate, because I can't think of anything apart from an RPM parser which would want to parse RPM payloads. So it would be a separate crate that we would be the only users of, probably ever.

@drahnr
Copy link
Contributor

drahnr commented Jul 15, 2023

I am mostly thinking operationally: applying upstream changes would be as easy as a git rebate or merge. I couldn't care less if we stay the only user if it simplifies the maintenence

@dralley
Copy link
Collaborator Author

dralley commented Jul 15, 2023

I don't think there will be any maintenance, the library is "finished" and hasn't seen any commits in a year. CPIO is very simple so there are unlikely to be any bugs.

@drahnr
Copy link
Contributor

drahnr commented Jan 2, 2024

We haven't reached a conclusion here, my preference is still on forking to rpm-rs/cpio-rpm and using that.

@dralley
Copy link
Collaborator Author

dralley commented Jan 2, 2024

I still have the opposite preference, tbh 🤷‍♂️. It's very difficult for me to imagine the supposed maintenance benefit repaying itself against having a separate crate which nobody but this particular library will ever use.

Since the new payload format removes nearly all of the metadata from the archive (because it's duplicated in the RPM header), you can do very little with the payload without also reading the RPM header. So the obvious thing to do is for us to just provide an API for that directly from this crate, since it would be pretty much the only useful way to use that code.

There is another development since we last had the discussion, which is that RPMv6 plans to use only the "new" payload scheme, so it won't be relegated to just packages with files >4gb anymore, it will eventually be all packages.

That is mentioned under the "Payload" section here: rpm-software-management/rpm#2374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants