Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Simplify file identifiers generation #2003

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

exiledkingcc
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ec85a27) 94.54% compared to head (40bb17f) 94.52%.
Report is 1 commits behind head on main.

Files Patch % Lines
pypdf/_writer.py 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2003      +/-   ##
==========================================
- Coverage   94.54%   94.52%   -0.02%     
==========================================
  Files          43       43              
  Lines        7549     7549              
  Branches     1490     1491       +1     
==========================================
- Hits         7137     7136       -1     
  Misses        253      253              
- Partials      159      160       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma MartinThoma changed the title MAINT: simplify file identifiers generation MAINT: Simplify file identifiers generation Jul 23, 2023
return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
def _compute_document_identifier(self) -> ByteStringObject:
md5 = hashlib.md5()
md5.update(str(time.time()).encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes document-generation non-deterministic, right?

@MartinThoma
Copy link
Member

What impact do the file identifiers have? Who/what makes use of them?

@MartinThoma MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jul 29, 2023
@exiledkingcc
Copy link
Contributor Author

the PDF standard says:

The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of the preceding algorithm might use different formats for the
current time, causing them to produce different file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.

the identifiers are also be used for encryption.

@MartinThoma so i think it's ok to make it simple.

@MartinThoma
Copy link
Member

Having a deterministic way to generate PDFs is valuable to several developers. Does the current deterministic identifier generation cause any issues?

@exiledkingcc
Copy link
Contributor Author

first of all, it cost too much for big pdf files.
and for aes encrypted pdf, it's not deterministic.
when PdfWriter.encrypt called, the identifiers are genearated by uncrypted pdf stream,
then PdfWriter.write called, the content of pdf file is encrypted, so the hash changed.
for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
so the identifiers cannot be the hash of pdf stream content.

MartinThoma added a commit that referenced this pull request Dec 23, 2023
See #2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
MartinThoma added a commit that referenced this pull request Dec 23, 2023
See #2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
pypdf/_writer.py Outdated Show resolved Hide resolved
MartinThoma added a commit that referenced this pull request Dec 23, 2023
#2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
MartinThoma added a commit that referenced this pull request Dec 23, 2023
#2003

Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>
@@ -1246,7 +1244,7 @@ def generate_file_identifiers(self) -> None:
id2 = self._compute_document_identifier()
else:
id1 = self._compute_document_identifier()
id2 = id1
id2 = ByteStringObject(id1.original_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id1 is a ByteStringObject already. So .original_bytes just returns id1. Then wrapping it in ByteStringObject doesn't do anything, right?

return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
md5 = hashlib.md5()
md5.update(str(time.time()).encode("utf-8"))
md5.update(str(self.fileobj).encode("utf-8"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.fileobj equivalent to self._write_pdf_structure(stream)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold PR requests that need clarification before they can be merged.A comment must give details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants