Fully embedded font is extracted only partially if it occupies more than one objects #2110

sedimentation-fault · 2022-12-08T19:23:45Z

Description

I have a PDF where the PDF reader was telling me that some fonts were fully embedded, so I decided to test my freshly installed PyMuPDF 1.21.0-rc2 on it. It told me that it "saved 16 fonts to" my working directory, but looking there revealed only 11 of them. Looking at the list of fonts from pdffonts, it became clear why: some fonts that were "fully" embedded were using more than one objects to store partial pieces of them. A check with mupdf confirmed that PyMuPDF extracted only the last object onto font-name.cff, probably because the output file (font-name.cff) was the same for all pieces/objects of font with name font-name.

How To Reproduce

You must have one of those PDFs that embed a font fully by storing pieces of it in multiple objects. Let's say x.pdf is one of them. pdffonts lists the fonts of x.pdf as follows:

pdffonts x.pdf

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Times-Roman                          Type 1C           MacRoman         yes no  no     567  0
CAJCNM+intiri                        Type 1C           Custom           yes yes yes    593  0
CAJCMM+intirr                        Type 1C           Custom           yes yes yes    595  0
CAJCON+intirsc                       Type 1C           Custom           yes yes yes    594  0
intirsc                              Type 1C           WinAnsi          yes no  no     596  0
DFMPPD+Springnew-Regular             Type 1C           Custom           yes yes yes    576  0
Times-Roman                          Type 1C           Custom           yes no  no     569  0
Times-Italic                         Type 1C           Custom           yes no  no     568  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     321  0
Times-Roman                          Type 1C           Custom           yes no  no     306  0
Times-Italic                         Type 1C           Custom           yes no  no     315  0
Times-BoldItalic                     Type 1C           WinAnsi          yes no  no     350  0
MT2SYS                               Type 1C           Custom           yes no  yes    379  0
MT2MIT                               Type 1C           MacRoman         yes no  no     392  0
MT2MIS                               Type 1C           WinAnsi          yes no  no     332  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     311  0

You can see that, for example, Times-Roman has 'no' in the "sub" (subsetted) column, meaning it is NOT subsetted - therefore it is fully embedded. You can also see that it occupied objects with numbers 567, 569 and 306.

mutool shows practically the same with its 'info' command:

mutool info -F x.pdf

Fonts (16):
        2       (820 0 R):      Type1 'Times-Roman' MacRomanEncoding (567 0 R)
        3       (819 0 R):      Type1 'CAJCNM+intiri' (593 0 R)
        3       (819 0 R):      Type1 'CAJCMM+intirr' (595 0 R)
        3       (819 0 R):      Type1 'CAJCON+intirsc' (594 0 R)
        3       (819 0 R):      Type1 'intirsc' WinAnsiEncoding (596 0 R)
        4       (301 0 R):      Type1 'DFMPPD+Springnew-Regular' (576 0 R)
        5       (298 0 R):      Type1 'Times-Roman' WinAnsiEncoding (569 0 R)
        5       (298 0 R):      Type1 'Times-Italic' WinAnsiEncoding (568 0 R)
        6       (295 0 R):      Type1 'Times-Bold' WinAnsiEncoding (321 0 R)
        6       (295 0 R):      Type1 'Times-Roman' WinAnsiEncoding (306 0 R)
        6       (295 0 R):      Type1 'Times-Italic' WinAnsiEncoding (315 0 R)
        10      (292 0 R):      Type1 'Times-BoldItalic' WinAnsiEncoding (350 0 R)
        22      (47 0 R):       Type1 'MT2SYS' (379 0 R)
        30      (798 0 R):      Type1 'MT2MIT' MacRomanEncoding (392 0 R)
        207     (620 0 R):      Type1 'MT2MIS' WinAnsiEncoding (332 0 R)
        220     (134 0 R):      Type1 'Times-Bold' WinAnsiEncoding (311 0 R)

We could use mutool extract x.pdf, but that is not user-friendly, as it a) extracts both all fonts and all images and b) it extracts fonts as font-XXXX.cff (or, possibly, font-XXXX.ttf), where XXXX has no relation to the font object number (contrary to what its documentation claims), so you practically don't know which file is which font, unless you open each one of them, or at least read its metadata somehow.

Enter PyMuPDF which promises to a) extract all fonts and b) give the extracted files sensible names.Alas, trying it on our x.pdf results on just 11 fonts - contrary to the claimed 16:

python -m fitz extract -fonts x.pdf

saved 16 fonts to ...

ls -l | awk -e '{print $5,$9}'

24759 CAJCMM+intirr.cff
25860 CAJCNM+intiri.cff
23898 CAJCON+intirsc.cff
1295 DFMPPD+Springnew-Regular.cff
217 MT2MIS.cff
506 MT2MIT.cff
286 MT2SYS.cff
17076 Times-Bold.cff
18332 Times-BoldItalic.cff
18302 Times-Italic.cff
24847 Times-Roman.cff

(first column is file size)

What has happened? Comparing this to the output of mutool extract

mutool extract x.pdf*
...
ls -l font-* | awk -e '{print $5, $9}'

217 font-0330.cff
506 font-0522.cff
286 font-0534.cff
18332 font-0549.cff
18302 font-0557.cff
17076 font-0558.cff
25078 font-0561.cff
26087 font-0564.cff
1295 font-0574.cff
23496 font-0579.cff
24759 font-0583.cff
23898 font-0587.cff
25860 font-0591.cff
24847 font-0599.cff

and looking carefully at the file sizes (first column), we see that Times-Roman.cff (the Times-Roman font as extracted by PyMuPDF) is exactly font-0599.cff (a font extracted by mutool, whose object number is NOT 599 (there is no font object with such a number in x.pdf)) - but this is only one of the three pieces (objects) that store parts of Times-Roman!

Your configuration (mandatory)

Operating system: Gentoo
Python version: 3.10
PyMuPDF version: 1.21.0-rc2, installation method: generated from source, using installed mupdf 1.21.0.

More precisely:

python -c 'import sys; import fitz; print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)'

3.10.0 (default, Feb 11 2022, 00:50:04) [GCC 11.2.0] 
 linux 
 
PyMuPDF 1.21.0rc2: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-07 00:00:01.
Built for Python 3.10 on linux (64-bit).

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2022-12-08T19:41:02Z

This seems to be a duplicate of #2109 - however I am not really sure.
As I wrote there:
Please provide a reproducer file and a clear description of result vs. your expectation.

sedimentation-fault · 2022-12-08T20:06:14Z

Please delete 2109 - I tried to correct it while it was on its way to the server and didn't realize it had already been created. Sorry about that.

I had also written my expectation - but got deleted somehow during my writing...

My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff.

I cannot provide an even more clear description of the result than what I have already done.
To put it in other words: the Times-Roman font consists of three objects. PyMuPDF does this:

extract Times Roman object 1 to Times-Roman.cff
extract Times Roman object 2 to Times-Roman.cff - at this point Times-Roman.cff is overwritten by the contents of object 2
extract Times Roman object 3 to Times-Roman.cff - at this point Times-Roman.cff is overwritten by the contents of object 3

Then it tells the user that "all 16" fonts have been extracted, but the user sees only, say, 11 - because some .cff files were overwritten in the extraction process.

I will try to send you the PDF to your outlook account, as I cannot put it publicly here.

sedimentation-fault · 2022-12-08T20:10:25Z

BTW, Times Roman object 1/2/3 are not fully embedded Times-Roman fonts. They are parts of the font that together form a fully embedded font. So overwriting Times-Roman.cff each time with the next object that happens to say "I keep data for 'Times-Roman' font" destroys parts of the font that we want to extract.

JorjMcKie · 2022-12-08T20:32:08Z

My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff.

This is either not possible or clearly beyond the intended scope of PyMuPDF. Features like this one should be looked for in dedicated font packages like fontTools.
But in the general case, your expection goes beyond what is possible - it is not a "restriction" of whatever tool.

What I suspect is really your problem instead: You extract font names without their subset identifier ABCDEF+. So your script treats different subsets of the same font as one and thus overwrites a previously extracted other subset font.
So why don't you do fitz.TOOLS.set_subset_fontnames(True) before the first text extraction - if this is what you were doing: unfortunately you didn't mention that, so I am forced to guess.

For the time being, I will convert this post from an issue to a "Discussions" item.

Fix #2110 (Discussion item #2111): File `__main__.py` - also include the font's xref in the generated file name. Fix #2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix #2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.

Fix pymupdf#2110 (Discussion item pymupdf#2111): File `__main__.py` - also include the font's xref in the generated file name. Fix pymupdf#2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix pymupdf#2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.

Fix #2110 (Discussion item #2111): File `__main__.py` - also include the font's xref in the generated file name. Fix #2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix #2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Dec 8, 2022

pymupdf locked and limited conversation to collaborators Dec 8, 2022

JorjMcKie converted this issue into discussion #2111 Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Fully embedded font is extracted only partially if it occupies more than one objects #2110

Fully embedded font is extracted only partially if it occupies more than one objects #2110

sedimentation-fault commented Dec 8, 2022 •

edited

JorjMcKie commented Dec 8, 2022

sedimentation-fault commented Dec 8, 2022 •

edited

sedimentation-fault commented Dec 8, 2022

JorjMcKie commented Dec 8, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Fully embedded font is extracted only partially if it occupies more than one objects #2110

Fully embedded font is extracted only partially if it occupies more than one objects #2110

Comments

sedimentation-fault commented Dec 8, 2022 • edited

Description

How To Reproduce

Your configuration (mandatory)

JorjMcKie commented Dec 8, 2022

sedimentation-fault commented Dec 8, 2022 • edited

sedimentation-fault commented Dec 8, 2022

JorjMcKie commented Dec 8, 2022

This issue was moved to a discussion.

sedimentation-fault commented Dec 8, 2022 •

edited

sedimentation-fault commented Dec 8, 2022 •

edited