pdfrw vs pypdf page extraction & merge #7

abubelinha · 2023-07-01T20:16:44Z

Test run on Python 3.8, Windows 7:

I took 4 arbitrary page numbers (pages 4,6,8,9).
For each of the benchmark listed pdf files I extracted those pages from it (if available).
Then I created a new pdf using the extracted pages, and repeated them between 1 and 5 times (to check how well pdfrw / pypdf optimize size of created pdfs containing repetitive information). So output pdfs will have up to 4x5 = 20 pages
I measure time employed and output sizes

I recall my initial code also deleted original bookmarks/annotations from pdfs, but I removed that part for simplicity and commented where I had read about that.

Code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def fsize(filepath):
	import os
	finfo = os.stat(filepath)
	fsize = finfo.st_size
	KB = "%.2f" % (fsize/1024)
	return([fsize,KB])
	
#@profile
def createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=None, pageslist=None, destpdf=None, debug=False):
	""" <https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py>
		"""
	from pdfrw import PdfWriter,PdfReader
	#import pdfrw_bookmarks # code from https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546
	pages = PdfReader(sourcepdf).pages
	totalpages = len(pages)
	outdata = PdfWriter(destpdf)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pdfrw ",p)
			#pdfrw_pageannots(pages[p-1])
			outdata.addpage(pages[p-1])
	outdata.write()

#@profile
def createpdf_from_sourcepdf_pages_pypdf(sourcepdf=None, pageslist=None, destpdf=None, debug=False, compress=False):
	""" Generate destpdf with list of certain pages taken from sourcepdf. 
		- <https://pypdf2.readthedocs.io/en/stable/user/merging-pdfs.html>
		- SO [Extract specific pages of PDF and save it with Python](https://stackoverflow.com/a/51885963/710788)
		"""
	from pypdf import PdfWriter,PdfReader
	fsource = open(sourcepdf, "rb")
	merger = PdfWriter()
	totalpages = len(PdfReader(fsource).pages)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pypdf ",p)
			# add page p (0-based index):
			merger.append(fileobj=fsource, pages=(p-1,p))
	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!
	# Write to an output PDF document
	output = open(destpdf, "wb")
	merger.write(output)
	# Close File Descriptors
	merger.close()
	output.close()

#from memory_profiler import profile
#@profile
def pypdf_vs_pdfrw():
	""" [performance comparative](https://github.com/pmaupin/pdfrw/issues/232#issuecomment-1436153435) between two packages:
		- pdfrw
		- pypdf
		"""
	print(datetime.now() - startTime, " before comparing")
	pdfurls = [
		"https://arxiv.org/pdf/2201.00151.pdf",
		"https://arxiv.org/pdf/1707.09725.pdf",
		"https://arxiv.org/pdf/2201.00021.pdf",
		"https://arxiv.org/pdf/2201.00037.pdf",
		"https://arxiv.org/pdf/2201.00069.pdf",
		"https://arxiv.org/pdf/2201.00178.pdf",
		"https://arxiv.org/pdf/2201.00201.pdf",
		"https://arxiv.org/pdf/1602.06541.pdf",
		"https://arxiv.org/pdf/2201.00200.pdf",
		"https://arxiv.org/pdf/2201.00022.pdf",
		"https://arxiv.org/pdf/2201.00029.pdf",
		"https://arxiv.org/pdf/1601.03642.pdf",
		]
	import requests,os
	pdfrw_Tsize = 0
	pdfrw_Ttime = 0
	pypdf_Tsize = 0
	pypdf_Ttime = 0
	for pdfurl in pdfurls:
		sourcepdf = pdfurl.split("/")[-1]
		if not os.path.exists(sourcepdf):
			response = requests.get(pdfurl, headers=None, params=None)
			if response.status_code == 200:
				with open(sourcepdf, 'wb') as f:
					f.write(response.content)
			else:
				print(response.status_code)
				print("COULDN'T DOWNLOAD  '{}' FILE:\n".format(pdfurl))
		if not os.path.exists(sourcepdf):
			print("\n","-_"*40,"\n\nSKIPPING '{}' FILE:\n".format(sourcepdf))
		else:
			print("\n","-_"*40,"\n\nTESTING WITH '{}' FILE:\n".format(sourcepdf))
			for i in range(1,6):
				pageslist=[4,6,8,9]*i #*5 eats all my memory when using pypdf with large pdf files
				print("-"*50,"\npageslist:",pageslist)
				start=datetime.now()
				destpdf=sourcepdf+"_pdfrw-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf); 
				pdfrw_t = round((datetime.now() - start).total_seconds(),3)
				pdfrw_s = fsize(destpdf)
				pdfrw_Ttime += pdfrw_t
				pdfrw_Tsize += pdfrw_s[0]
				print("pdfrw: {} KB output size, took {} seconds".format(pdfrw_s[1],pdfrw_t))
				start=datetime.now()
				destpdf=sourcepdf+"_pypdf-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pypdf(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
				pypdf_t = round((datetime.now() - start).total_seconds(),3)
				pypdf_s = fsize(destpdf)
				pypdf_Ttime += pypdf_t
				pypdf_Tsize += pypdf_s[0]
				print("pypdf: {} KB output size, took {} seconds".format(pypdf_s[1],pypdf_t))
				print("pypdf_time / pdfrw_time = {} ratio".format(round(pypdf_t/pdfrw_t, 2)))
				print("pypdf_size / pdfrw_size = {} ratio".format(round(pypdf_s[0]/pdfrw_s[0], 2)))
			
	import pdfrw,pypdf
	print("-_"*40)
	print("\n pdfrw.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pdfrw.__version__, pdfrw_Tsize/1024/1024, pdfrw_Ttime))
	print("\n pypdf.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pypdf.__version__, pypdf_Tsize/1024/1024, pypdf_Ttime))

if __name__ == "__main__":
	import sys
	from datetime import datetime
	startTime = datetime.now()
	print("START: ",startTime)
	pypdf_vs_pdfrw()
	endTime = datetime.now()
	print("\nEND: ",endTime)
	print("\nTOTAL TIME: ",endTime-startTime)

OUTPUT:

START:  2023-07-01 22:06:17.718288
0:00:00  before comparing

 -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

TESTING WITH '2201.00151.pdf' FILE:

--------------------------------------------------
pageslist: [4, 6, 8, 9]
pdfrw: 591.29 KB output size, took 0.109 seconds
pypdf: 660.78 KB output size, took 0.499 seconds
pypdf_time / pdfrw_time = 4.58 ratio
pypdf_size / pdfrw_size = 1.12 ratio
--------------------------------------------------
(... LINES DELETED TO AVOID TOO LONG OUTPUT ...)
--------------------------------------------------
pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 130.20 KB output size, took 0.047 seconds
pypdf: 836.60 KB output size, took 1.031 seconds
pypdf_time / pdfrw_time = 21.94 ratio
pypdf_size / pdfrw_size = 6.43 ratio
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.47 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 108.14 seconds

END:  2023-07-01 22:08:11.767827

TOTAL TIME:  0:01:54.049539

The text was updated successfully, but these errors were encountered:

MartinThoma · 2023-07-02T07:26:37Z

Thank you for sharing!

I just updated my benchmark.py code + made a run with the latest libraries ... and I realized that it's super hard to understand 🙈 I need to simplify it to make it easier to extend.

MartinThoma · 2023-07-02T09:14:14Z

I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: a78f609

MartinThoma · 2023-07-02T09:17:19Z

@abubelinha Did you use https://pypi.org/project/pdfrw/ or https://pypi.org/project/pdfrw2/?

abubelinha · 2023-07-02T09:22:07Z

EDIT: sorry, I misunderstood (thought you were asking about pypdf)

I asked this question and then used this installation:
pip install pdfrw2

So my current output says it's sarnold's pdfrw version 0.5.0
But my original test was done in February 20th 2023, so I guess that used pmaupin's pdfrw, probably version 0.4.0 (that script didn't output versions; I incorporated that code yesterday).

Anyway, differences vs pypdf were pretty much the same.

MartinThoma · 2023-07-02T09:48:36Z

Thank you 🙏

I just added pdfrw to my comparision with watermarking speed/file size: commit / rendered output

It's crazy. I did expect that it's faster, but not that much. And I didn't expect a difference in file size.

pdfrw did an awesome job there. Thank you for pointing that out!

abubelinha · 2023-07-02T17:03:47Z

What about my pypdf code implementation? Do you see something wrong?

I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: a78f609

I edited my above code

I avoid now counting pages multiple times (moved "len(PdfReader(fsource).pages)" out of the for loop).
I added a boolean compress parameter to my pypdf-based function, and copied your compression loop like this:

	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!

Do you see something else to change on my pypdf usage?

Change 1 reduced total pypdf time in a few seconds (still not using compression: compress=False)

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.59 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 93.47 seconds

Change 2 didn't work as expected:
Oddly, when using compress=True, my results were even worse.
Not only because of the much longer time delay (which I would expect since it now runs the compression loop).
It also produced a bigger output size:

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.52 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 203.83 MB
Total time: 237.30 seconds

So I'd say there must be something wrong either in my pypdf code or in the pypdf compression algorithm.
Perhaps compression is only useful for some kind of PDF files? (i.e. those including uncompressed images or something alike).
But I used the same pdf files proposed for your benchmark tests, so I am surprised that "compressed" are bigger than "uncompressed".

Indeed, total times vary a little bit between identical script runs
(I guess it depends on other non-python processes running in my computer).

MartinThoma · 2023-07-08T14:43:34Z

I've just noticed that you used the same files as I did in my benchmark. Nice!

This was referenced Jul 1, 2023

Future maintenance sarnold/pdfrw#15

Open

[Discussion] PyPDF2 ❤️ pdfrw pmaupin/pdfrw#232

Open

MartinThoma added the enhancement New feature or request label Jul 2, 2023

MartinThoma mentioned this issue Jul 7, 2023

Huge file produced when merging (watermarking/stamping) py-pdf/pypdf#1951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfrw vs pypdf page extraction & merge #7

pdfrw vs pypdf page extraction & merge #7

abubelinha commented Jul 1, 2023 •

edited

MartinThoma commented Jul 2, 2023

MartinThoma commented Jul 2, 2023

MartinThoma commented Jul 2, 2023

abubelinha commented Jul 2, 2023 •

edited

MartinThoma commented Jul 2, 2023

abubelinha commented Jul 2, 2023 •

edited

MartinThoma commented Jul 8, 2023

pdfrw vs pypdf page extraction & merge #7

pdfrw vs pypdf page extraction & merge #7

Comments

abubelinha commented Jul 1, 2023 • edited

Code:

OUTPUT:

MartinThoma commented Jul 2, 2023

MartinThoma commented Jul 2, 2023

MartinThoma commented Jul 2, 2023

abubelinha commented Jul 2, 2023 • edited

MartinThoma commented Jul 2, 2023

abubelinha commented Jul 2, 2023 • edited

MartinThoma commented Jul 8, 2023

abubelinha commented Jul 1, 2023 •

edited

abubelinha commented Jul 2, 2023 •

edited

abubelinha commented Jul 2, 2023 •

edited