Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: METADATA file should be the last file within a wheel and should be uncompressed #397

Open
KOLANICH opened this issue Mar 16, 2021 · 12 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Mar 16, 2021

Rationale: wheels use zip archives. Zip archives don't allow gaps. It is a practically needed thing to edit wheels metadata. It can be done 3 ways:

  1. creating a new archive and copying files from an old one there.
    • waste of computation and disk space and IO ops. For flash - also extended and unneeded burn.
  2. splitting archive into 3 parts: before the file being replaced, the file being replaced, and after it. The part before is kept intact, then file is appended, then after part. Plus recomputation of the index. Closer the file being modified to the end - more we can keep intact. If METADATA file is the last one, then we only need to recompute the index.
  3. When a file is compressed, we cannot easily control its blob size in compressed form. When a file in zip archive is stored uncompressed, there is a bijection between uncompressed size and compressed size. If we keep the blob size intact, then we can replace the file in-place without recalculation of indexes and shifting parts after it. The only thing we need to recalculate is CRC.

So I propose:

About long description ... it may make sense to move it into an own file.

@fochoao
Copy link

fochoao commented Mar 17, 2021

I am sorry if I sound offensive, but You could save code defining something like this x = y = z = "same", or a, b, c = 5, 3.2, "Hello".

Example: archive: zipfile.ZipFile, fileInfo: zipfile.ZipInfo, newContent: typing.Union[bytes, bytearray, "mmap.mmap"]
zipfile1, zipInfo1, newcontent1 = "archive: zipfile.ZipFile", "fileInfo: zipfile.ZipInfo", 'newContent: typing.Union[bytes, bytearray, "mmap.mmap"]"'

Then after that: print(zipfile1, zipinfo1, newcontent1)

Defining the variables in a row, and reuse them. Printing that would do it.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 17, 2021

@fochoao,

I am sorry if I sound offensive

You don't sound offensive.

But

  1. I don't understand what you mean. The only thing I understand is that your advice is related to chain assignment somehow and to my code.

  2. such advices are likely offtop in this issue, and probably should go to the repo where the code is situated, and probably should be bound to lines in commits.

@agronholm
Copy link
Contributor

When a file is compressed, we cannot easily control its blob size in compressed form. When a file in zip archive is stored >uncompressed, there is a bijection between uncompressed size and compressed size. If we keep the blob size intact, then we >can replace the file in-place without recalculation of indexes and shifting parts after it. The only thing we need to recalculate is >CRC.

The wheel standard (PEP 427) recommends that .dist-info is placed last in the archive, specifically for the purpose of easily modifying the metadata, and this is what the wheel project does. Is this not enough?

Please also note that while this is the reference implementation of the standard, it's probably not the only one (I think one or two packaging tools roll their own wheels). Thus you would not be able to fully rely on the ability to do this.

About long description ... it may make sense to move it into an own file.

This is a proposal to change the wheel standard and thus does not belong on this bug tracker. If you feel there is enough justification to do this (and thus breaking wheels for all current tools), please make your proposal on the packaging section of discuss.python.org.

@uranusjr
Copy link
Member

I want to point out if you edit METADATA, you need to edit RECORD as well, so that needs be placed last to achieve your goal.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 17, 2021

I want to point out if you edit METADATA, you need to edit RECORD as well

Thanks! I have forgotten about it.

so that needs be placed last to achieve your goal.

Or just kept uncompressed. Then the hash can be replaced in place, so it can be placed before metadata.

But compression of that file makes sense, and I know how it can be solved. We can replace this file with a binary one storing a trie. In KS notation it is a kind of

seq:
  - id: hash_function
    type: u1
    enum: hash_function
  - id: compression_type
    type: u1
    enum: compression_type
    doc: It is proposed to use `lzma2` for a compression (brotli gives better, but not in standard lib). Compressions used must satisfy the property of being streaming, compressions of 2 strings having a common prefix must have the prefix that corresponds to the uncompressed prefix.
  - id: dictionary_size
    type: u4
  - id: dictionary
    size: dictionary_size
    doc: An opaque blob storing a dictionary for a compression lib.  Can be empty if the algo or an impl doesn't support it. 
  - id: trie_type
    type: u1
    enum: trie_type
  - id: trie
    size-eos: true
    doc: contains a serialized trie, with stripped signatures, mapping compressed filenames to their hashes.

Creation:

  1. generate a dictionary from all the filenames present.
  2. for each file calculate its hash and insert into the trie using compression of filename as a key

Enumeration: Just walk the trie and decompress the keys.
Lookup: compress the key, then lookup compression in the trie.
Modification: just lookup and replace in-place.

Straightforward and extensible.

This is a proposal to change the wheel standard and thus does not belong on this bug tracker. If you feel there is enough justification to do this (and thus breaking wheels for all current tools), please make your proposal on the packaging section of discuss.python.org.

How about a PEP and python-ideas?

@fochoao
Copy link

fochoao commented Mar 17, 2021

This is however a big idea, and one I think should be implemented. Exactly to double-check, with sha25sum can be done, or sha1sum. If it is one it matches if it is not then it shall not be even taken in count. But definetely keeping in count security. I will come up with something too.

@agronholm
Copy link
Contributor

This is however a big idea, and one I think should be implemented. Exactly to double-check, with sha25sum can be done, or sha1sum. If it is one it matches if it is not then it shall not be even taken in count. But definetely keeping in count security. I will come up with something too.

What does it have to do with security?

@fochoao
Copy link

fochoao commented Mar 18, 2021

Remember that a package zipped should have a hash for fact-checking, that You are getting the real package.

@agronholm
Copy link
Contributor

That hash has to come from a secure location though because otherwise the hash could be falsified too. And any changes to the zip would change the hash too.

@KOLANICH
Copy link
Author

The file with hashes can be gpg-signed though. But signing-related issues are IMHO offtop here.

@fochoao
Copy link

fochoao commented May 5, 2021

The file with hashes can be gpg-signed though. But signing-related issues are IMHO offtop here.

Yes, You are right on this one.

@amberkushwaha
Copy link

yes you are riht on this one.the file with hashes can be gp-signed though.But signing issues are IMHO offtop here.Yes You are right in this one.is the main circuit intials of the file in it.as mentioned above in the circuit and also mentioned in the given targated in an as follows to it for thew given time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants