pdf.getDocumentInfo().title sometimes None #511

clach04 · 2019-08-09T03:35:55Z

Test case in #511 (comment) (original no longer available test case Details in https://github.com/mstamy2/PyPDF3/issues/13).

pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
info.title  # reports None
info['/Title']  # works

The text was updated successfully, but these errors were encountered:

clach04 · 2022-04-08T23:02:18Z

@MartinThoma is this now resolved? Or can't reproduce as the other project appears to have been removed? I might have the original bug script around if that would be helpful.

MartinThoma · 2022-04-09T12:19:01Z

#13 is the issue you've linked to. We moved PyPDF2 from mstamy2 -> me (MartinThoma) -> py-pdf (a Github Organization) this week.

As PyPDF2 has been inactive for a long time, I need to clean up a lot of things (PRs, issues, code, documentation, ...).

I'm sorry that I didn't add more details. Could you please check if the issue still persists? If so, please create a MCVE (including a PDF): https://stackoverflow.com/help/minimal-reproducible-example . Then I'll re-open :-)

clach04 · 2022-04-09T18:09:53Z

That's great news @MartinThoma! Thanks for the quick status update. How about any future issues get closed with a link to #657 as an explanation?

Back to this specific issue, this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists. I have the test code in claird#59 but the test pdf is only in the deleted repo (it was about 5Mb). @mstamy2 is the original owner, I noticed he responded to #657 so pinging him in case the repo is simply private rather than deleted as that's where I put all my notes. @mstamy2 do you still have access to the https://github.com/mstamy2/PyPDF3/issues/13?

I can try and check an old hard drive to see if I still have it but I won't have access to it for a while :-(

clach04 · 2022-04-09T18:31:11Z

New test case created based on original :-)

Overview

I've seen a number of PDF files where the title attribute/property is reported as None but when then accessing /Title there is content. I've no idea if this is a problem with the pdf(s) or with PyPDF. There is a workaround (which may be an indication of a potential change to PyPDF but I'm unclear of what the correct thing to do here is)

Attached PDF is about 5Mb and is a sample of a document that exhibits this behavior, I did not create it (nor do I know how it was created) so the only information we have is the metadata inside.

Test case, along with workaround below:

Test PDF file

title_bug.pdf

Test case

EDIT inline version dumps PyPDF version (attached version does not).

inline and attached (rename to .py)
pdf_title_bug.txt

#!/usr/bin/env python
# -*- coding: windows-1252 -*-
# vim:ts=4:sw=4:softtabstop=4:smarttab:expandtab
#

import os
import sys

ver_to_test = 3
ver_to_test = 4
ver_to_test = 2

if ver_to_test == 4:
    from pypdf import PdfFileReader  # https://github.com/claird/PyPDF4
elif ver_to_test == 3:
    from PyPDF3 import PdfFileReader  # https://github.com/mstamy2/PyPDF3
else:
    from PyPDF2 import PdfFileReader  # https://github.com/py-pdf/PyPDF2 - nee https://github.com/mstamy2/PyPDF2 / https://pythonhosted.org/PyPDF2/
    import PyPDF2 as pypdf_lib


print('Python %s on %s' % (sys.version, sys.platform))
print(pypdf_lib.__version__)

filename = 'title_bug.pdf'
f = open(filename, 'rb')
pdf = PdfFileReader(f)
info = pdf.documentInfo
#print(info)
print('title attribute %r' % info.title)  # reports None
print('title getText() %r' % info.getText("/Title"))  # this is what .title property calls
print('title get() %r' % info.get("/Title"))  # this is part of what dict[] does
print('title get().getObject() %r' % info.get("/Title").getObject())  # this is what dict[] does
print('/Title dict entry %r' % info['/Title'])  # with test pdf works
print('title attribute %r' % info.title)  # Sanity check it is still None
print('title Workaround %r' % (info.title or info['/Title']))  # Workaround


f.close()

output

Python 2

Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)] on win32
1.27.2
title attribute None
title getText() None
title get() IndirectObject(305, 0)
title get().getObject() u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
/Title dict entry u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
title attribute None
title Workaround u'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'

Python 3

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32
1.27.2
title attribute None
title getText() None
title get() IndirectObject(305, 0)
title get().getObject() 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
/Title dict entry 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'
title attribute None
title Workaround 'DIY   ENTERTAINMENT Retro Gaming on Raspberry Pi- Understanding ROMs, RetroPie, Recalbox, and More'

clach04 · 2022-04-09T18:39:20Z

@MartinThoma hopefully this helps. I really know nothing about PDF internals which is why I've not attempted a fix :-( I have a workaround that seems to be effective but not sure if it is reasonable.

If you need anything else from me on this please ping me (I recall I had other PDFs with similar behaviors, this was one of the smaller ones).

Thanks for picking up the torch on this and trying to organize collaboration

MartinThoma · 2022-04-09T18:41:45Z

this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists

The repo was moved issue 13 from the linked repo is #13 here.

MartinThoma · 2022-04-09T18:43:20Z

Thank you very much! I now hope that somebody will pick it up and dig into it :-)

clach04 · 2022-04-09T18:44:33Z

updated test case with PyPDF2 version

clach04 · 2022-04-14T03:40:59Z

Located another PDF, appears to be created with the same PDF generator.

Adobe Reader reports; PDF Version 1.3 (Acrobat 4.x). InfoKey: Producer, InfoValue: macOS Version 10.10 Quartz PDFContext

https://github.com/ValveSoftware/steamlink-sdk/blob/87666bbd3c512fe0aef19cd024fd0f0bf0765fb4/external/qt-everywhere-opensource-src-5.9.1/qtwebengine/src/3rdparty/chromium/third_party/grpc/examples/objective-c/route_guide/Misc/Images.xcassets/second.imageset/second.pdf

This file is much smaller than the attached test PDF.

I also found some Microsoft Word generated PDFs but when I attempted to export/create PDFs from recent Word, the title worked fine.

clach04 · 2022-04-14T03:43:42Z

@MartinThoma

this issue has nothing to do with #13, its specifically about the linked issue to a repo that no longer exists

The repo was moved issue 13 from the linked repo is #13 here.

This is not correct, I've copy/pasted the one email I have where someone posted to the issue I created (note issue, not a PR - in a completely different repo):

From: johns1c notifications@github.com
Sent: Wednesday, May 6, 2020 5:45 PM
To: mstamy2/PyPDF3 PyPDF3@noreply.github.com
Cc: Chris Clark ; Author author@noreply.github.com
Subject: Re: [mstamy2/PyPDF3] pdf.getDocumentInfo().title sometimes None (#13)

Are you running Python 3 by any chance
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

Handle case when title really is None

MartinThoma · 2022-04-15T11:08:09Z

There is something really weird about that PDF:

When you upload it to http://pdf-analyser.edpsciences.org/result/1e54b64d it also gives no title
mutool clean -d title_bug.pdf title_bug.txt seems to be catched in an infinite loop (from mupdf-tools)

Closes #511

MartinThoma closed this as completed Apr 8, 2022

MartinThoma reopened this Apr 9, 2022

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 9, 2022

clach04 added a commit to clach04/PyPDF2 that referenced this issue Apr 14, 2022

Fix issues py-pdf#511 - title sometimes none

6399908

MartinThoma linked a pull request Apr 14, 2022 that will close this issue

Fix title sometimes None #744

Merged

MartinThoma mentioned this issue Apr 14, 2022

Fix title sometimes None #744

Merged

clach04 added a commit to clach04/PyPDF2 that referenced this issue Apr 15, 2022

Fix py-pdf#511 title sometimes None

77e83a2

Handle case when title really is None

MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Apr 15, 2022

MartinThoma closed this as completed in #744 Apr 15, 2022

MartinThoma pushed a commit that referenced this issue Apr 15, 2022

ROBUST: title sometimes None (#744)

29194cd

Closes #511

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf.getDocumentInfo().title sometimes None #511

pdf.getDocumentInfo().title sometimes None #511

clach04 commented Aug 9, 2019 •

edited

clach04 commented Apr 8, 2022

MartinThoma commented Apr 9, 2022

clach04 commented Apr 9, 2022 •

edited

clach04 commented Apr 9, 2022 •

edited

clach04 commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

clach04 commented Apr 9, 2022

clach04 commented Apr 14, 2022

clach04 commented Apr 14, 2022

MartinThoma commented Apr 15, 2022

pdf.getDocumentInfo().title sometimes None #511

pdf.getDocumentInfo().title sometimes None #511

Comments

clach04 commented Aug 9, 2019 • edited

clach04 commented Apr 8, 2022

MartinThoma commented Apr 9, 2022

clach04 commented Apr 9, 2022 • edited

clach04 commented Apr 9, 2022 • edited

Overview

Test PDF file

Test case

output

Python 2

Python 3

clach04 commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

MartinThoma commented Apr 9, 2022

clach04 commented Apr 9, 2022

clach04 commented Apr 14, 2022

clach04 commented Apr 14, 2022

MartinThoma commented Apr 15, 2022

clach04 commented Aug 9, 2019 •

edited

clach04 commented Apr 9, 2022 •

edited

clach04 commented Apr 9, 2022 •

edited