New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152
Comments
Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed. Does anyone have any thoughts as to why this might be? |
When you say that ‘Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed’ - so if you save the image from pdfminer.six to a file, and run Pillow and numpy over it in a separate script, there is no problem? Could you provide a self-contained script that demonstrates the error? |
Sorry for the delay in my reply, I've been quite busy. In order to ensure reproducibility I've turned it into a docker image. The python code to get the
error is # -*- coding: utf-8 -*-
"""
Created on Thu Sep 6 12:30:15 2018
@author: stuart
"""
#demonstrating the jpeg2K error
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
import magic
import numpy as np
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
filename = './with_jpeg_2000s.pdf'
image_folder = './'
fp = open(filename, 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
file_typer = magic.Magic()
pages = []
for pagei, page in enumerate(PDFPage.create_pages(document)):
pages.append(page)
for pagei, page in enumerate(pages):
page_jpeg2000s = {}
resources = page.resources
file_typer = magic.Magic()
try:
image_resources = resources["XObject"]
except:
image_resources = {}
#if its not a dict we have another layer of wrapping to go through
if not isinstance(image_resources, dict):
image_resources = image_resources.resolve()
#lt.log_debug(logger, image_resources)
save_images = []
for image_i, image_res in enumerate(image_resources.keys()):
im_filename = image_folder + "/image-%d.jpeg"%image_i
im_ref = image_resources[image_res]
bts = im_ref.resolve().rawdata
#raw = bts.rawdata
print(pagei)
print(image_i)
Im = Image.open(BytesIO(bts))
#load the file into memory
Im.load()
ar = np.array(Im)
im2 = Image.fromarray(ar, "CMYK").convert("RGB")
print(file_typer.from_buffer(bts))
plt.figure()
plt.imshow(im2)
plt.show() The requirements file is
and the Dockerfile is
The file I'm testing on is here: https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf |
Thanks. So yes, as with #1510, the error you are receiving is because the images are truncated. Adding Here is a variation of your script. Running this over your attached PDF, 102 images are now processed before it hits a different error - from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
from io import BytesIO
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
i = 0
with open('with_jpeg_2000s.pdf', 'rb') as fp:
parser = PDFParser(fp)
document = PDFDocument(parser)
for page in PDFPage.create_pages(document):
image_resources = page.resources.get("XObject", {})
for im_ref in image_resources.values():
i += 1
bts = im_ref.resolve().rawdata
try:
im = Image.open(BytesIO(bts))
im.load()
except IOError:
print(str(i)+" images processed")
raise |
Any progress with this? |
I would also be interested to know the status of this, or if anyone has an idea of the cause or if there is a workaround. Thank you. |
The largest failing image from the PDF is 829 bytes - I'm not convinced that valid image data is being passed to Pillow? I think the original problem with numpy will be helped by #5379 |
Closing, unless someone can demonstrate there is a valid image that Pillow is failing to read. |
What did you do?
I extracted a JPEG 2000 from a pdf as bytes. I then loaded the result into Pillow using
Next I attempted to convert to a numpy array in order to manipulate the data. Using
This resulted in the array
When I attempted to force numpy to convert this array to a sequence of numbers the result was the error:-
This thread had a similar issue with loading jpeg2000 files but I don't understand their resolution #1510
What did you expect to happen?
I expected the image to be converted into a numpy array.
What actually happened?
The system would either error or create an array containing the image object
What versions of Pillow and Python are you using?
python 3.5/3.6 (3.6 when running inside a docker container)
Pillow==5.1.0
numpy==1.14.1
I am using pdfminer.six to extract the image on the first page of this document as a test:-
https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf
The really odd issue is that when I try convert using the console using the exact same command,
I get a numpy array as expected
Can anyone think of a reason for this discrepancy that would allow me to perform the conversion while running my code?
Thanks
The text was updated successfully, but these errors were encountered: