Page::insertFont() - possible to only insert font data for characters in use? #855

cuteufo · 2021-01-22T04:01:24Z

cuteufo
Jan 22, 2021

ATTENTION:

This thread is no longer relevant since PyMuPDF's support of subset fonts.

You are free to insert fonts in your PDF as before.
Other features and functions have been implemented after creation of this post, which may automatically also include fonts - without you even noticing it. This can happen when using Page.insert_htmlbox() or the Story class.

In all of these cases, just make sure to use doc.subset_fonts() immediately before you save your work.
But you must also use garbage collection and compression when you save, i.e. doc.ez_save(). Only then the ghosts of the old fonts will be removed ...

Is your feature request related to a problem? Please describe.
pymupdf version: 1.18.6
Python: 3.8.2

I'm trying to use a font file, say 'Dengxian-light.ttf', to insert a Unicode string containing both Chinese and English characters. My code follows:

fontfile= "dengxian-light.ttf"
page.insertFont(fontname="EXT_0", fontfile=fontfile)
text = "姓名 name"
page.insertText((20, 20), text, fontname="EXT_0")
doc.save(output_pdf_path)

The saved PDF file look fine in viewer, but the file size is hugely inflated because I guess it has embedded the entire font file into the PDF, instead of only the font data for characters in use.

Describe the solution you'd like
is it possible to embed just the font data for characters in use? Or if I missed something from the doc, please remind me. Thanks.

Describe alternatives you've considered
Are there several options for how your request could be met?

Additional context
Add any other context or screenshots about the feature request here.

Answered by JorjMcKie

Jan 22, 2021

An interesting question!
Your observation is correct: if inserting text the complete fontfile is included in the PDF, which can be big. The reason is that PyMuPDF does not know, which characters you intend to use.
There are ways to build font subsets, and there also are Python packages that let you do this.

If you use an office software like LibreOffice or Word, they do font subsetting internally, when you export a document to PDF. So the resulting file will be relatively small and depend on the total set of characters you ever used in the Word document.
And here comes the difference to using PyMuPDF: it does not and cannot know this!

I have been experimenting:

Before actually inserting …

View full answer

cuteufo · 2021-01-22T04:11:22Z

cuteufo
Jan 22, 2021
Author

Sorry I missed a few words. I have also tried following code:

page.insertText((20, 20), text, fontfile="dengxian-light.ttf")

this ends up with the pdf showing correct English characters while the Chinese characters are shown as dots.

if I change it to:

page.insertText((20, 20), text, fontfile="dengxian-light.ttf", fontname="EXT_1")

it behaves same as original post, hugely inflated.

Thanks.

0 replies

JorjMcKie · 2021-01-22T08:49:00Z

JorjMcKie
Jan 22, 2021
Maintainer

An interesting question!
Your observation is correct: if inserting text the complete fontfile is included in the PDF, which can be big. The reason is that PyMuPDF does not know, which characters you intend to use.
There are ways to build font subsets, and there also are Python packages that let you do this.

If you use an office software like LibreOffice or Word, they do font subsetting internally, when you export a document to PDF. So the resulting file will be relatively small and depend on the total set of characters you ever used in the Word document.
And here comes the difference to using PyMuPDF: it does not and cannot know this!

I have been experimenting:

Before actually inserting text via PyMuPDF, I analyzed that text and built the corresponding font subset (using package fonttools), then inserted the text using that subset font, and heureka the resulting file was small!
The problem is that this sequence of steps cannot be reversed. So you cannot first insert text and then shrink the font (because of technical reasons I don't want to explain right now).
There seems to be no way to put all this into some type of handy recipe.

But maybe you want to consider an alternative:
I have created scripts that support replacing fonts in existing PDFs here. That folder also contains a detailed HOWTO. The basic steps are:

create your PDF
analyze the used fonts with those scripts
modify the output JSON file such that each font is "replaced" with itself
execute the font replcacement.

The result should be a much smaller PDF - which looks exactly like the original.

0 replies

cuteufo · 2021-01-24T07:16:38Z

cuteufo
Jan 24, 2021
Author

Thanks for your wonderful comments @JorjMcKie , I will definitely take a look into the font replacing scripts.

1 reply

JorjMcKie Jan 24, 2021
Maintainer

let me give you a start - see next post

JorjMcKie · 2021-01-24T09:40:46Z

JorjMcKie
Jan 24, 2021
Maintainer

Here is a Python script that produces a PDF with one page of text with a mix of Latin (German) and Chinese characters.
It uses the universal "Droid Sans Fallback Regular" font, which supports this - and many more languages.
Please install PyMuPDF for it via $ python -m pip install pymupdf-fonts.

This script peking.py produces peking.pdf which has a size of 1.7 MB (because of the large font).

Then execute $ python repl-fontnames.py peking.pdf. This will produce JSON file peking.pdf-fontnames.json which you must modify: replace string "keep" by "china-s" - the fontname used in script peking.py.

Then execute $ python repl-font.py peking.pdf. This will use the modified JSON and create a new pdf peking-new.pdf - which has a file size of only 22 KB, less than 2% of the original! This is the run log:

$ python repl-font.py peking.pdf

Processing PDF 'peking.pdf' with 1 page.

Phase 1: Create sets of used unicodes per new font.
End of phase 1, 0.03 seconds.


Font replacement overview:
 Droid Sans Fallback Regular replaced by: Droid Sans Fallback Regular.

Building font subsets:
Used 106 glyphs of font 'Droid Sans Fallback Regular'. 3462 KB saved.
Font subsets built, 0.5 seconds.

Phase 2: rebuild document.
End of phase 2, 0.01 seconds
Total duration 0.55 seconds

peking.zip

1 reply

cuteufo Jan 24, 2021
Author

Thank you @JorjMcKie . I walked through the steps, all is good until the last step which produced following message:

$ python repl-font.py peking.pdf
Processing PDF 'peking.pdf' with 1 page.

Phase 1: Create sets of used unicodes per new font.
End of phase 1, 0.03 seconds.


Font replacement overview:
 Droid Sans Fallback Regular replaced by: Droid Sans Fallback Regular.

Building font subsets:
Used 106 glyphs of font 'Droid Sans Fallback Regular'. 3462 KB saved.
Font subsets built, 0.41 seconds.

Phase 2: rebuild document.
page 0 exception: Peking (bzw. nach Maßgabe der chinesischen Regierung amtlich Beijing, chinesisch 北京,
page 0 exception: Pinyin Běijīng, W.-G. Pei-ching – „Nördliche Hauptstadt“, ist die Hauptstadt der
page 0 exception: Volksrepublik China. Peking hat eine über dreitausendjährige Geschichte und ist heute
page 0 exception: eine regierungsunmittelbare Stadt, d. h. sie ist direkt der Zentralregierung unterstellt und
page 0 exception: damit Provinzen, autonomen Gebieten und Sonderverwaltungszonen gleichgestellt.
page 0 exception: Das gesamte 16.807 Quadratkilometer große Verwaltungsgebiet Pekings hat 21,5
page 0 exception: Millionen Einwohner (Stand: März 2016).[2] Es stellt kein zusammenhängendes Stadtgebiet
page 0 exception: dar, mit seiner dominierenden ländlichen Siedlungsstruktur ist es eher mit einer Provinz
page 0 exception: vergleichbar.[3] Von der Gesamtbevölkerung sind 11,8 Millionen registrierte Bewohner mit
page 0 exception: ständigem Wohnsitz (戶口 / 户口, hùkǒu) und 7,7 Millionen temporäre Einwohner
page 0 exception: (流動人口 / 流动人口, liúdòng rénkǒu) mit befristeter Aufenthaltsgenehmigung (暫住證 /
page 0 exception: 暂住证, zànzhùzhèng).[4] Wird die Kernstadt (hohe Bebauungsdichte und geschlossene
page 0 exception: Ortsform) als Grundlage genommen, leben in Peking 7,7 Millionen Menschen mit
page 0 exception: Hauptwohnsitz (Stand 2007).[5] Der Ballungsraum (einschließlich Vororte) hat 11,8
page 0 exception: Millionen Einwohner (Stand 2007).[6] Ab 2018 soll die Metropole Kern einer Megalopolis
page 0 exception: von 130 Millionen Einwohnern namens Jing-Jin-Ji werden.
page 0 exception: Peking ist als Hauptstadt das politische Zentrum Chinas. Aufgrund der langen Geschichte
page 0 exception: beherbergt Peking ein bedeutendes Kulturerbe. Dies umfasst die traditionellen
page 0 exception: Wohnviertel mit Hutongs, den Tian’anmen-Platz (天安門廣場 / 天安门广场 – „wörtl. Platz
page 0 exception: am Tor des Himmlischen Friedens“), die 1987 von der UNESCO zum Weltkulturerbe
page 0 exception: erklärte Verbotene Stadt, den neuen und alten Sommerpalast und verschiedene Tempel,
page 0 exception: wie z. B. 2012 den Himmelstempel, den Lamatempel und den Konfuziustempel. 
Traceback (most recent call last):
  File "repl-font.py", line 560, in <module>
    clean_fontnames(page)
  File "repl-font.py", line 347, in clean_fontnames
    page._setContents(xref)  # tell PDF: this is the only /Contents object
AttributeError: 'Page' object has no attribute '_setContents'

If I just change line 347 to:

    page.set_contents(xref)

it produced following message:

$ python repl-font.py peking.pdf
Processing PDF 'peking.pdf' with 1 page.

Phase 1: Create sets of used unicodes per new font.
End of phase 1, 0.03 seconds.


Font replacement overview:
 Droid Sans Fallback Regular replaced by: Droid Sans Fallback Regular.

Building font subsets:
Used 106 glyphs of font 'Droid Sans Fallback Regular'. 3462 KB saved.
Font subsets built, 0.38 seconds.

Phase 2: rebuild document.
page 0 exception: Peking (bzw. nach Maßgabe der chinesischen Regierung amtlich Beijing, chinesisch 北京,
page 0 exception: Pinyin Běijīng, W.-G. Pei-ching – „Nördliche Hauptstadt“, ist die Hauptstadt der
page 0 exception: Volksrepublik China. Peking hat eine über dreitausendjährige Geschichte und ist heute
page 0 exception: eine regierungsunmittelbare Stadt, d. h. sie ist direkt der Zentralregierung unterstellt und
page 0 exception: damit Provinzen, autonomen Gebieten und Sonderverwaltungszonen gleichgestellt.
page 0 exception: Das gesamte 16.807 Quadratkilometer große Verwaltungsgebiet Pekings hat 21,5
page 0 exception: Millionen Einwohner (Stand: März 2016).[2] Es stellt kein zusammenhängendes Stadtgebiet
page 0 exception: dar, mit seiner dominierenden ländlichen Siedlungsstruktur ist es eher mit einer Provinz
page 0 exception: vergleichbar.[3] Von der Gesamtbevölkerung sind 11,8 Millionen registrierte Bewohner mit
page 0 exception: ständigem Wohnsitz (戶口 / 户口, hùkǒu) und 7,7 Millionen temporäre Einwohner
page 0 exception: (流動人口 / 流动人口, liúdòng rénkǒu) mit befristeter Aufenthaltsgenehmigung (暫住證 /
page 0 exception: 暂住证, zànzhùzhèng).[4] Wird die Kernstadt (hohe Bebauungsdichte und geschlossene
page 0 exception: Ortsform) als Grundlage genommen, leben in Peking 7,7 Millionen Menschen mit
page 0 exception: Hauptwohnsitz (Stand 2007).[5] Der Ballungsraum (einschließlich Vororte) hat 11,8
page 0 exception: Millionen Einwohner (Stand 2007).[6] Ab 2018 soll die Metropole Kern einer Megalopolis
page 0 exception: von 130 Millionen Einwohnern namens Jing-Jin-Ji werden.
page 0 exception: Peking ist als Hauptstadt das politische Zentrum Chinas. Aufgrund der langen Geschichte
page 0 exception: beherbergt Peking ein bedeutendes Kulturerbe. Dies umfasst die traditionellen
page 0 exception: Wohnviertel mit Hutongs, den Tian’anmen-Platz (天安門廣場 / 天安门广场 – „wörtl. Platz
page 0 exception: am Tor des Himmlischen Friedens“), die 1987 von der UNESCO zum Weltkulturerbe
page 0 exception: erklärte Verbotene Stadt, den neuen und alten Sommerpalast und verschiedene Tempel,
page 0 exception: wie z. B. 2012 den Himmelstempel, den Lamatempel und den Konfuziustempel. 
End of phase 2, 0.01 seconds
Total duration 0.43 seconds

it did generate peking-new.pdf, but it is empty, no content at al.

I am using PyMuPDF of version 1.18.6 on MacOS 11.1 with Python 3.8.2.

JorjMcKie · 2021-01-24T18:24:25Z

JorjMcKie
Jan 24, 2021
Maintainer

Weird. Please insert after the statement print("page %i exception:" % page.number, text) (line 560 abouts) a reraise of the expeception - raise so we know what is happening there.

0 replies

JorjMcKie · 2021-01-24T18:29:24Z

JorjMcKie
Jan 24, 2021
Maintainer

you have installed pymupdf-fonts haven't you?

1 reply

cuteufo Jan 25, 2021
Author

yes I installed pymupdf-fonts.

I checked and did some change to the code:

541                try:
542                     tw.append(
543                         span["origin"],
544                         text,
545                         font=font,
546                         fontsize=resize(span, font),  # use adjusted fontsize
547                         wmode=wmode,
548                         markup_dir=markup_dir,
549                         bidi_level=bidi_level,
550                     )
551                 except Exception as err:           # <--- was 'except'
552                     print("page %i exception:" % page.number, text)
553                     print(f"{err}")                        # <--- print the err message

and the error message says: append() got an unexpected keyword argument 'wmode'

I recalled I also made a change at line 347:

347    page.set_contents(xref)    # <-- was: page._setContents(xref), but python says Page hasn't _setContents

This script was downloaded from here. Am I having the same version of script as yours?

Thanks again. @JorjMcKie

JorjMcKie · 2021-01-25T02:36:41Z

JorjMcKie
Jan 25, 2021
Maintainer

Am I having the same version of script as yours?

No, take mine please. I need to update the other one.

5 replies

cuteufo Jan 26, 2021
Author

Thanks. That works. I studied the script. It basically:

pick out the code points used for each font in the doc
use fontTools to subset the corresponding font part of these code points
rewrite the doc content, so that the font subset is embedded, instead of the entire font buffer

This is really smart. I will check how to get it work together with the codes in OP.

JorjMcKie Jan 26, 2021
Maintainer

Great you like it.
I guess you better see now what I mean with my lengthy first reply:
If all those text pieces inserted by peking.py would be known before they get inserted, then font subsets could be built right away, and peking.py could directly use those instead of the original fonts.

cuteufo Jan 27, 2021
Author

Thanks @JorjMcKie . I tried to make a script to accomplish this based upon your code, but the generated PDF is not showing the text. However, if I check the content of the PDF, I can see the text has been written in the PDF.

I have attached the script, running log, and generated PDF in zip file. Could you advice if I have missed something in the code? Thanks a lot!
wp2.zip

JorjMcKie Jan 27, 2021
Maintainer

oops sorry, responded to you, which was meant to someone else.

Your script is ok, it works for the "Droid Sans Fallback Font". So the problem is the font you chose.
I saw a lot of error messages from fontTools when building the subset of that other font from yours.

cuteufo Jan 27, 2021
Author

Thanks @JorjMcKie . I have also figured out that it is the problem of the font file. I treated a OpenType Postscript file as a Truetype file. When I use a real TTF file, it works perfectly, the text is shown and the PDF file is very small. You have inspired me on resolving this issue.

JorjMcKie · 2021-01-27T14:15:16Z

JorjMcKie
Jan 27, 2021
Maintainer

So the most practical thing to do is creating your PDF as you did before.
Then run the font replacing stuff.
A possibly more elegant way could be to generate that intermediate JSON file directly from your PDF creator script - you actually have all the information for this.
This would save you a manual intervention ...

1 reply

JorjMcKie Jan 27, 2021
Maintainer

So create the JSON file at end of your script and execute repl-font.py passing that dynamically created JSON and the PDF to shrink ...

JorjMcKie · 2021-01-27T14:29:38Z

JorjMcKie
Jan 27, 2021
Maintainer

Contemplating a bit more about this idea:

We could make a version of repl-font.py, which can be imported. So the whole logic would be like this snippet:

import fitz
import font_replace
doc = fitz.open(...)  # new or existing PDF
# create your text pages, ...
# make changes to existing text pages,
# etc.
# when everything is done:
font_replace.replace(doc,  # the document
    font_list,  # a list of all fonts used to write text
    )
doc.save(...)

1 reply

cuteufo Jan 27, 2021
Author

can't agree more :)

cuteufo · 2021-01-28T08:01:29Z

cuteufo
Jan 28, 2021
Author

I can't wait for the imported version, so I made one, named subsetfonts.py, by combining your two scripts, repl-font.py and repl-fontnames.py, with a few minor modifications, as attached, along with a main script to call it. My modifications are all marked with cuteufo in comments. Would you mind take a look and let me know if any questions? Great thanks.
subsetfont.zip

0 replies

JorjMcKie · 2021-01-28T10:04:31Z

JorjMcKie
Jan 28, 2021
Maintainer

Not at my computer right now ... so have to postpone my feedback. But I love your initiative!!! A big thank you in advance!

0 replies

JorjMcKie · 2021-01-28T12:33:51Z

JorjMcKie
Jan 28, 2021
Maintainer

@cuteufo - excellent start!
I reviewed it and have the following comments:

This does not replace fonts, but tries to build a subset for every font used by some page. This is fine and the intention. But:
We should exclude fonts that already are subsets. They are recognizable by a "+" at position 6 of the font name in doc.getPageFontList() like "ABCDEF+...". Ignore those fonts.
We should also ignore any, where fontTools has problems. This is recognizable by returning None instead of a bytes object (fontbuffer).
Non-embedded fonts (extension = "n/a") deserve another special treatment: in cases like "Helvetica" (Base-14 fonts), there exist embeddable versions with the same name. Other cases have no such easy way of back-referencing to an embedded version. Because in any case we would increase the PDF size, let's decide to also exclude them right away - like the fonts with a "+".
All fonts that are out of the game in one or the other way, should be treated as not present ... respective text using them should not be rewritten for example.
And then of course we do not need all those timing information.

I am looking forward to test your next version. Once we are done to our mutual satisfaction, 😉, we may want to include it in the official PyMuPDF package as an optional Document method, i.e. usable as doc.subset_fonts().

Optional means, we would check whether fontTools exists, when importing fitz - in __init__.py. Whether installed or not would be handled like this:

try:
    import fontTools
    fitz.Document.subset_fonts = fitz.utils.subset_fonts  # the function will reside in utils.py
    del fontTools
except ImportError:
    fitz.Document.subset_fonts = lambda x: print("fontTools not installed")

0 replies

cuteufo · 2021-01-28T16:11:41Z

cuteufo
Jan 28, 2021
Author

Thanks for your great comments. Look forward to the new feature in your official package.

I updated the code and, in order to review the code more easily, I uploaded both old and updated code on Github.

In the updated version, I have tried to fix the problem in your comments 2, 3, and 6. For comments 4 and 5, honestly I didn't understand it because my limited knowledge about PDF specifications. Would you please check the code again?

I am doing this because my project in job requires to write text in particular fonts into an existing PDF. The code is working for the project and I will have to go on with other tasks. But I will try my best to make time for future updates of subset_fonts.

5 replies

cuteufo Jan 29, 2021
Author

Just to let u know: I have done some updates on the code in Github. If you have time, you can check it and let me know if any problems. Thanks.

JorjMcKie Jan 30, 2021
Maintainer

Thanks - I already had this implemented in my copy of your version. I even extended that logic to exclude all fonts that do not have one of the extensions "otf" of "ttf", because fontTools is only able to subset these types.

cuteufo Feb 1, 2021
Author

This is wonderful. In which version will we be able to use it?

JorjMcKie Feb 1, 2021
Maintainer

I hope I can make it for the next version, 1.18.7.
I still need to optimize the code a little: there are unnecessary pieces given that font subsetting is a developed as a special case of font replacement ...

JorjMcKie Feb 1, 2021
Maintainer

The new version 1.18.7 should be ready in the course of this week.

Page::insertFont() - possible to only insert font data for characters in use? #855

cuteufo Jan 22, 2021

Replies: 13 comments · 15 replies

cuteufo Jan 22, 2021 Author

JorjMcKie Jan 22, 2021 Maintainer

cuteufo Jan 24, 2021 Author

JorjMcKie Jan 24, 2021 Maintainer

JorjMcKie Jan 24, 2021 Maintainer

cuteufo Jan 24, 2021 Author

JorjMcKie Jan 24, 2021 Maintainer

JorjMcKie Jan 24, 2021 Maintainer

cuteufo Jan 25, 2021 Author

JorjMcKie Jan 25, 2021 Maintainer

cuteufo Jan 26, 2021 Author

JorjMcKie Jan 26, 2021 Maintainer

cuteufo Jan 27, 2021 Author

JorjMcKie Jan 27, 2021 Maintainer

cuteufo Jan 27, 2021 Author

JorjMcKie Jan 27, 2021 Maintainer

JorjMcKie Jan 27, 2021 Maintainer

JorjMcKie Jan 27, 2021 Maintainer

cuteufo Jan 27, 2021 Author

cuteufo Jan 28, 2021 Author

JorjMcKie Jan 28, 2021 Maintainer

JorjMcKie Jan 28, 2021 Maintainer

cuteufo Jan 28, 2021 Author

cuteufo Jan 29, 2021 Author

JorjMcKie Jan 30, 2021 Maintainer

cuteufo Feb 1, 2021 Author

JorjMcKie Feb 1, 2021 Maintainer

JorjMcKie Feb 1, 2021 Maintainer

cuteufo
Jan 22, 2021

Replies: 13 comments 15 replies

cuteufo
Jan 22, 2021
Author

JorjMcKie
Jan 22, 2021
Maintainer

cuteufo
Jan 24, 2021
Author

JorjMcKie Jan 24, 2021
Maintainer

JorjMcKie
Jan 24, 2021
Maintainer

cuteufo Jan 24, 2021
Author

JorjMcKie
Jan 24, 2021
Maintainer

JorjMcKie
Jan 24, 2021
Maintainer

cuteufo Jan 25, 2021
Author

JorjMcKie
Jan 25, 2021
Maintainer

cuteufo Jan 26, 2021
Author

JorjMcKie Jan 26, 2021
Maintainer

cuteufo Jan 27, 2021
Author

JorjMcKie Jan 27, 2021
Maintainer

cuteufo Jan 27, 2021
Author

JorjMcKie
Jan 27, 2021
Maintainer

JorjMcKie Jan 27, 2021
Maintainer

JorjMcKie
Jan 27, 2021
Maintainer

cuteufo Jan 27, 2021
Author

cuteufo
Jan 28, 2021
Author

JorjMcKie
Jan 28, 2021
Maintainer

JorjMcKie
Jan 28, 2021
Maintainer

cuteufo
Jan 28, 2021
Author

cuteufo Jan 29, 2021
Author

JorjMcKie Jan 30, 2021
Maintainer

cuteufo Feb 1, 2021
Author

JorjMcKie Feb 1, 2021
Maintainer

JorjMcKie Feb 1, 2021
Maintainer