BUG: NameObject doesn't handle all values correctly #2601

Rak424 · 2024-04-16T20:01:40Z

Name object encodes all characters from 1 to 255, so it's not correct to use utf-8.

7.3.5 Name objects
Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0).

pubpub-zz · 2024-04-16T21:16:37Z

@Rak424
on page 18 of PDF32000-1:2008 you will see

The 8bits only encoding will be done with Latin1

pubpub-zz · 2024-04-17T19:31:52Z

@Rak424
please provide a pdf with the issue you are trying to fix. without I can not consider your PR as acceptable

art049 · 2024-04-17T19:56:50Z

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

codecov · 2024-04-17T20:56:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.92%. Comparing base (6d9a7ec) to head (508c82c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2601      +/-   ##
==========================================
- Coverage   94.92%   94.92%   -0.01%     
==========================================
  Files          50       50              
  Lines        8316     8309       -7     
  Branches     1667     1665       -2     
==========================================
- Hits         7894     7887       -7     
  Misses        261      261              
  Partials      161      161

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/test_generic.py

stefan6419846 · 2024-04-18T07:01:51Z

We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance.

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

pubpub-zz · 2024-04-18T11:44:58Z

Hey @Rak424 and @pubpub-zz, We've been running the benches on this branch, and it seems these changes are significantly improving the PDFWriter performance. I think it's worth considering in merging this or not.

To obtain those, we installed CodSpeed on a fork synced with this repo. You can have a look at the full report here.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions.
Are you dealing with decoding? Encoding?

art049 · 2024-04-18T16:44:36Z

If I read the detailed results correctly, this is a 4% difference on 1s. Can we really consider this a "significant improvement"?

It may seem small, but with over 4.5M downloads per month, I think the cumulated impact of such a change is significant.

The title identifies that this PR is dealing with an invalid handling. I'm concerned about getting the good result and preventing regressions.
Are you dealing with decoding? Encoding?

Yes maybe this speedup means there is a regression somewhere. Overall the bench includes encoding afaik:

Rak424 · 2024-04-18T18:27:48Z

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:

NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')

utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

pubpub-zz · 2024-04-18T19:33:29Z

Sorry to not explain directly what I've done, but I've seen different problems to solve in that part of code and some cases where clearly badly handled, here is a simple example, but not the only one:
bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/#FF"), None).write_to_stream(bio)
print(bio.getvalue)
> b'/#C3#BF'

b'/\xc3\xbf'.decode("utf-8") returns properly ÿ which is equivalent to b'/\xff'.decode("latin-1")

for me there is no error.

The way it was done was not optimal, so I've decided to try to refactor it and I'm not surprised about perf improvement.

A correct implementation should be able to do something like this, since from specs encoding is not defined on Name object itself, but I know that's not easy to make that change:
NameObject(b"hello")
NameObject(b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c (%)')
utf-8 is only for the specific case of creating a name from a text, but doesn't allow to handle all possible values, so I've added a fallback with latin1 who handle all values on a byte.

I often need some days/week and multiple tries to reach an optimal solution, tell me if you see some problems or changes to do.

remember thatthe current code provides with NameObject.CHARSETS a way to add / remove charsets from the possible solutions and the default, based on other PRs uses the utf-8, gbk and then latin-1. all these cases come from actual PDFs. have you experienced any actual issues with some PDF?

Rak424 · 2024-04-19T22:59:25Z

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value, you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255, and current implementation of name object only allows str and not bytes in constructor, so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless. I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf. Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

pubpub-zz · 2024-04-20T07:44:40Z

b"/#FF" corresponds to value b'\xff' and b'/#C3#BF' to value b'\xc3\xbf', that s not the same value and there is no encoding notion here, specs only says that in the specific case of when you try to use string value,

you have to encode it with utf8, but utf8 doesn't allow to use all chars from 1 to 255and current implementation of name object only allows str and not bytes in constructor.

NameObjects are actually string like : the proof is about using the encoding with the '#' character. The byte encoding is just a matter of file encoding.

so my workaround is to fallback to an encoding who allows all values in one byte, latin1 seems a good choice for me. I don t know why gbk is present here but it seems useless.

This fix was introduced years ago. I' trying some archeology to find it

I don't have an example to show but the current implementation doesn't respect specifications, I'm not talking about handling wrong pdf.

Don't hesitate to tell me if I've misunderstood something, pdf format is very very very bad.

I agree that the current implementation does not match the specification:

The historical approach in pypdf is more permissive and as we are rewriting all the names with PdfWriter, there is no issue with readers.
Moving to a full compliant solution may be introduce lots of regression that will be very tricky to detect (NameObjects are a sort of lazy linking)
As long as this PR does not fix an existing issue I'm not fond of it.

@MartinThoma / @MasterOdin / @stefan6419846
can you give your opinions?

stefan6419846 · 2024-04-20T08:12:16Z

I have to admit that it is hard to follow this for me. If I understand it correctly, there are some different interpretations of the PDF specs regarding name objects.

In my understanding, names are just an arbitrary list of bytes. An application has to ensure a consistent handling of them for matching purposes, but as they are mostly irrelevant to the user and primarily about the internal structure, there are no further rules except for predefined sequences, for example as part of dictionary keys.

The newly introduced test sequences from ISO 32000-2:2020, Table 4 pass with the old implementation as well, thus I do not see any real need for changing anything about this unless there is a real example where the current implementation fails and the new one does not.

tests/test_generic.py

Rak424 · 2024-04-29T19:19:27Z

to be more clear, here is two examples that can lead to problems hard to detect:

name value is modified after reading/writing operations:

bio = BytesIO()
NameObject.read_from_stream(BytesIO(b"/DocuSign#AE"), None).write_to_stream(bio)
print(bio.getvalue())  # > b'/DocuSign#C2#AE' != b"/DocuSign#AE"

all delimiter characters should be escaped, but it's not the case:

bio = BytesIO()
NameObject("/#()<>[]{}/%").write_to_stream(bio)
print(bio.getvalue())  # > b'/#23#28#29<>[]{}#2F#25'

pubpub-zz · 2024-04-30T20:55:22Z

@Rak424 can you provide actual PDF and code that shows some issues and not unitary.
About the two examples you are providing,

even if the binary encoding are different, all names will be encoded in the same way and should match
the PDF specification states:

"Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation."
the current implementation is valid

Rak424 · 2024-05-05T07:44:21Z

hi

about example 2, from spec:

c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.

All characters except the white-space characters and delimiters are referred to as regular characters.

It's not a huge problem since it's easy to any parser to handle it, but could be a problem with an extremely strict parser.

The main problem is on example 1:
The original value is b'DocuSign\xae' who has no UTF8 representation and corresponds to 'DocuSign®' in latin1
The final value is b'DocuSign\xc2x\ae' who corresponds to 'DocuSign®' in utf8 and 'DocuSignÂ®' in latin1

So, it means that if you append a pdf, it will creates a modification on these kinds of values. On a signed pdf where you add revocation infos, it will create a modification even if you don't modify the document.

pubpub-zz · 2024-05-05T17:39:34Z

@Rak424
I see your example. However both b'DocuSign\xc2\xae' and b'DocuSign\xae' are translated into 'DocuSign®' NameObject.
Again I agree that the 2 forms are binary different so could be considered as different whereas pypdf will consider them as identical, but I do not see actual use cases where both versions will be used in the same PDF.
If a document uses version 2, all instances will be translated for version 1 and the document will be properly read.

If we do a change to keep the bytes internally, b'DocuSign\xae' will be stored, but without changing the arguments, NameObject('DocuSign®') will get internally b'DocuSign\xc2\xae' and matching will not work anymore.
This will require lots of change and will break the API and induce lots of side effects in all programs. This is not acceptable

up to now you have not reported high level PDF that are failing. without I would consider your report as a acceptable deviation of pypdf to the PDF specification

Rak424 · 2024-05-05T18:58:33Z

do you understand that b'DocuSign\xc2\xae' is not b'DocuSignx\ae' ?

pubpub-zz · 2024-05-06T20:19:52Z

I do if you consider a full compliance to the spec. however as said I do not consider that both can be met at the same time. Also we need a way to build a NameObject from an str that can match one or the other, or both.

Please provide a PDF file showing some issue

francois and others added 4 commits April 16, 2024 20:46

NameObject encoding on one byte only

4d2465a

null in NameObject

f0fb1d6

Merge branch 'py-pdf:main' into name

af50623

NameObject encoding

ba80426

handling utf8 and latin1

3169286

Rak424 changed the title ~~NameObject wrong encoding~~ BUG: NameObject doesn't handle all values correctly Apr 17, 2024

strict name object

cba799b

unused include

9bfecb9

stefan6419846 reviewed Apr 18, 2024

View reviewed changes

tests/test_generic.py Outdated Show resolved Hide resolved

francois added 2 commits April 18, 2024 19:55

simplify null char handling in name object

0e36818

simplify null char handling in name object

48ad44f

francois and others added 4 commits April 18, 2024 20:36

cleaning include

71b885b

Merge branch 'main' into name

2c8a2ae

handling default encoding change on failure

f07a783

ruff

abe5d0c

francois added 3 commits April 21, 2024 19:34

name tests

d489fc4

name tests

ec6d4d1

test deprecated name

1f3853f

reduce change

393dc6e

stefan6419846 reviewed Apr 22, 2024

View reviewed changes

tests/test_generic.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Apr 22, 2024

View reviewed changes

tests/test_generic.py Outdated Show resolved Hide resolved

francois and others added 2 commits April 25, 2024 19:34

cleaning test

de6c202

Merge branch 'main' into name

508c82c

pubpub-zz added the needs-discussion The PR/issue needs more discussion before we can continue label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: NameObject doesn't handle all values correctly #2601

BUG: NameObject doesn't handle all values correctly #2601

Rak424 commented Apr 16, 2024

pubpub-zz commented Apr 16, 2024

pubpub-zz commented Apr 17, 2024

art049 commented Apr 17, 2024 •

edited

codecov bot commented Apr 17, 2024 •

edited

stefan6419846 commented Apr 18, 2024

pubpub-zz commented Apr 18, 2024 •

edited

art049 commented Apr 18, 2024

Rak424 commented Apr 18, 2024 •

edited

pubpub-zz commented Apr 18, 2024

Rak424 commented Apr 19, 2024

pubpub-zz commented Apr 20, 2024

stefan6419846 commented Apr 20, 2024

Rak424 commented Apr 29, 2024

pubpub-zz commented Apr 30, 2024

Rak424 commented May 5, 2024

pubpub-zz commented May 5, 2024

Rak424 commented May 5, 2024

pubpub-zz commented May 6, 2024

BUG: NameObject doesn't handle all values correctly #2601

Are you sure you want to change the base?

BUG: NameObject doesn't handle all values correctly #2601

Conversation

Rak424 commented Apr 16, 2024

pubpub-zz commented Apr 16, 2024

pubpub-zz commented Apr 17, 2024

art049 commented Apr 17, 2024 • edited

codecov bot commented Apr 17, 2024 • edited

Codecov Report

stefan6419846 commented Apr 18, 2024

pubpub-zz commented Apr 18, 2024 • edited

art049 commented Apr 18, 2024

Rak424 commented Apr 18, 2024 • edited

pubpub-zz commented Apr 18, 2024

Rak424 commented Apr 19, 2024

pubpub-zz commented Apr 20, 2024

stefan6419846 commented Apr 20, 2024

Rak424 commented Apr 29, 2024

pubpub-zz commented Apr 30, 2024

Rak424 commented May 5, 2024

pubpub-zz commented May 5, 2024

Rak424 commented May 5, 2024

pubpub-zz commented May 6, 2024

art049 commented Apr 17, 2024 •

edited

codecov bot commented Apr 17, 2024 •

edited

pubpub-zz commented Apr 18, 2024 •

edited

Rak424 commented Apr 18, 2024 •

edited