MAINT: Simplify file identifiers generation #2003

exiledkingcc · 2023-07-22T16:16:42Z

No description provided.

codecov · 2023-07-22T16:30:50Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ec85a27) 94.54% compared to head (40bb17f) 94.52%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
pypdf/_writer.py	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2003      +/-   ##
==========================================
- Coverage   94.54%   94.52%   -0.02%     
==========================================
  Files          43       43              
  Lines        7549     7549              
  Branches     1490     1491       +1     
==========================================
- Hits         7137     7136       -1     
  Misses        253      253              
- Partials      159      160       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MartinThoma · 2023-07-29T09:24:31Z

pypdf/_writer.py

-        return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
+    def _compute_document_identifier(self) -> ByteStringObject:
+        md5 = hashlib.md5()
+        md5.update(str(time.time()).encode("utf-8"))


This makes document-generation non-deterministic, right?

MartinThoma · 2023-07-29T09:25:22Z

What impact do the file identifiers have? Who/what makes use of them?

exiledkingcc · 2023-07-29T14:50:53Z

the PDF standard says:

The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of the preceding algorithm might use different formats for the
current time, causing them to produce different file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.

the identifiers are also be used for encryption.

@MartinThoma so i think it's ok to make it simple.

MartinThoma · 2023-08-13T07:21:25Z

Having a deterministic way to generate PDFs is valuable to several developers. Does the current deterministic identifier generation cause any issues?

exiledkingcc · 2023-08-14T05:00:46Z

first of all, it cost too much for big pdf files.
and for aes encrypted pdf, it's not deterministic.
when PdfWriter.encrypt called, the identifiers are genearated by uncrypted pdf stream,
then PdfWriter.write called, the content of pdf file is encrypted, so the hash changed.
for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
so the identifiers cannot be the hash of pdf stream content.

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

pypdf/_writer.py

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma · 2023-12-23T20:14:45Z

pypdf/_writer.py

@@ -1246,7 +1244,7 @@ def generate_file_identifiers(self) -> None:
            id2 = self._compute_document_identifier()
        else:
            id1 = self._compute_document_identifier()
-            id2 = id1
+            id2 = ByteStringObject(id1.original_bytes)


id1 is a ByteStringObject already. So .original_bytes just returns id1. Then wrapping it in ByteStringObject doesn't do anything, right?

MartinThoma · 2023-12-23T20:17:45Z

pypdf/_writer.py

-        return ByteStringObject(_rolling_checksum(stream).encode("utf8"))
+        md5 = hashlib.md5()
+        md5.update(str(time.time()).encode("utf-8"))
+        md5.update(str(self.fileobj).encode("utf-8"))


Is self.fileobj equivalent to self._write_pdf_structure(stream)?

MAINT: simplify file identifiers generation

5fd1e91

MartinThoma changed the title ~~MAINT: simplify file identifiers generation~~ MAINT: Simplify file identifiers generation Jul 23, 2023

MartinThoma reviewed Jul 29, 2023

View reviewed changes

MartinThoma added the on-hold PR requests that need clarification before they can be merged.A comment must give details label Jul 29, 2023

exiledkingcc force-pushed the simplify branch 2 times, most recently from 9092a14 to 5fd1e91 Compare September 11, 2023 06:55

exiledkingcc and others added 3 commits September 11, 2023 15:09

Merge remote-tracking branch 'origin/main' into simplify

741185d

Merge branch 'main' into simplify

b1b5b61

Merge branch 'main' into simplify

ffd4407

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute

44574a2

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma mentioned this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute #2361

Merged

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: Add PdfWriter._ID attribute (#2361)

beca111

See #2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

Merge branch 'main' into simplify

7a286b9

MartinThoma reviewed Dec 23, 2023

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

Update pypdf/_writer.py

f095cdc

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: File identifier generation restructuring

3d84ba8

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

MartinThoma mentioned this pull request Dec 23, 2023

STY: File identifier generation restructuring #2362

Merged

MartinThoma added a commit that referenced this pull request Dec 23, 2023

STY: File identifier generation restructuring (#2362)

ec85a27

#2003 Co-authored-by: exiledkingcc <exiledkingcc@gmail.com>

Merge branch 'main' into simplify

f8c7bf6

MartinThoma mentioned this pull request Dec 23, 2023

DOC: Quote specs in generate_file_identifiers #2363

Merged

Merge branch 'main' into simplify

40bb17f

MartinThoma reviewed Dec 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Simplify file identifiers generation #2003

MAINT: Simplify file identifiers generation #2003

exiledkingcc commented Jul 22, 2023

codecov bot commented Jul 22, 2023 •

edited

MartinThoma Jul 29, 2023

MartinThoma commented Jul 29, 2023

exiledkingcc commented Jul 29, 2023

MartinThoma commented Aug 13, 2023

exiledkingcc commented Aug 14, 2023

MartinThoma Dec 23, 2023

MartinThoma Dec 23, 2023

MAINT: Simplify file identifiers generation #2003

Are you sure you want to change the base?

MAINT: Simplify file identifiers generation #2003

Conversation

exiledkingcc commented Jul 22, 2023

codecov bot commented Jul 22, 2023 • edited

Codecov Report

MartinThoma Jul 29, 2023

Choose a reason for hiding this comment

MartinThoma commented Jul 29, 2023

exiledkingcc commented Jul 29, 2023

MartinThoma commented Aug 13, 2023

exiledkingcc commented Aug 14, 2023

MartinThoma Dec 23, 2023

Choose a reason for hiding this comment

MartinThoma Dec 23, 2023

Choose a reason for hiding this comment

codecov bot commented Jul 22, 2023 •

edited