Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

Closed
stuartspotlight opened this issue Jun 1, 2018 · 8 comments

Comments

@stuartspotlight
Copy link

stuartspotlight commented Jun 1, 2018

What did you do?

I extracted a JPEG 2000 from a pdf as bytes. I then loaded the result into Pillow using

im = Image.open(BytesIO(raw))

Next I attempted to convert to a numpy array in order to manipulate the data. Using

A = np.array(im)

This resulted in the array

array(<PIL.Jpeg2KImagePlugin.Jpeg2KImageFile image mode=RGBA size=1598x1598 at 0x7FDEAF719A20>,
      dtype=object)

When I attempted to force numpy to convert this array to a sequence of numbers the result was the error:-

Traceback (most recent call last):

  File "<ipython-input-39-042ef64a6b36>", line 1, in <module>
    runfile('/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images/test_embeded_image_extractor_locally.py', wdir='/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images')

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images/test_embeded_image_extractor_locally.py", line 402, in <module>
    A = np.array(im.getdata())

  File "/usr/local/lib/python3.5/dist-packages/PIL/Image.py", line 1220, in getdata
    self.load()

  File "/usr/local/lib/python3.5/dist-packages/PIL/Jpeg2KImagePlugin.py", line 210, in load
    return ImageFile.ImageFile.load(self)

  File "/usr/local/lib/python3.5/dist-packages/PIL/ImageFile.py", line 250, in load
    raise_ioerror(err_code)

  File "/usr/local/lib/python3.5/dist-packages/PIL/ImageFile.py", line 59, in raise_ioerror
    raise IOError(message + " when reading image file")

OSError: broken data stream when reading image file

This thread had a similar issue with loading jpeg2000 files but I don't understand their resolution #1510

What did you expect to happen?

I expected the image to be converted into a numpy array.

What actually happened?

The system would either error or create an array containing the image object

What versions of Pillow and Python are you using?

python 3.5/3.6 (3.6 when running inside a docker container)
Pillow==5.1.0
numpy==1.14.1

I am using pdfminer.six to extract the image on the first page of this document as a test:-

https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf

The really odd issue is that when I try convert using the console using the exact same command,

a = np.array(im)

I get a numpy array as expected

array([[[140, 118, 219,  82],
        [145, 114, 210,  84],
        [147, 111, 195,  86],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       [[149, 112, 213,  84],
        [143, 115, 206,  84],
        [133, 119, 193,  83],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       [[155, 106, 203,  86],
        [138, 116, 198,  83],
        [119, 129, 188,  77],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       ...,

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]],

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]],

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]]], dtype=uint8)

Can anyone think of a reason for this discrepancy that would allow me to perform the conversion while running my code?

Thanks

@stuartspotlight
Copy link
Author

Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed. Does anyone have any thoughts as to why this might be?

@radarhere
Copy link
Member

When you say that ‘Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed’ - so if you save the image from pdfminer.six to a file, and run Pillow and numpy over it in a separate script, there is no problem?

Could you provide a self-contained script that demonstrates the error?

@stuartspotlight
Copy link
Author

stuartspotlight commented Sep 10, 2018

Sorry for the delay in my reply, I've been quite busy. In order to ensure reproducibility I've turned it into a docker image. The python code to get the

broken data stream when reading image file

error is

# -*- coding: utf-8 -*-
"""
Created on Thu Sep  6 12:30:15 2018

@author: stuart
"""

#demonstrating the jpeg2K error

import matplotlib.pyplot as plt

from PIL import Image

from io import BytesIO
import magic

import numpy as np

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage


filename = './with_jpeg_2000s.pdf'

image_folder = './'

fp = open(filename, 'rb')

parser = PDFParser(fp)

document = PDFDocument(parser)


file_typer = magic.Magic()

pages = []
for pagei, page in enumerate(PDFPage.create_pages(document)):
    
    pages.append(page)
    
    
for pagei, page in enumerate(pages):
    
    page_jpeg2000s = {}
    
    resources = page.resources
    
    file_typer = magic.Magic()
    try:
        image_resources = resources["XObject"]
    except:
        image_resources = {}
        
    #if its not a dict we have another layer of wrapping to go through
    if not isinstance(image_resources, dict):
        image_resources = image_resources.resolve()
    
    #lt.log_debug(logger, image_resources)
    save_images = []
    for image_i, image_res in enumerate(image_resources.keys()):
             
        im_filename = image_folder + "/image-%d.jpeg"%image_i      
        
        im_ref = image_resources[image_res]

    
        bts = im_ref.resolve().rawdata
        
        #raw = bts.rawdata
        print(pagei)
        print(image_i)
        
        Im = Image.open(BytesIO(bts))

        #load the file into memory
        Im.load()
        
        ar = np.array(Im)
        
        im2 = Image.fromarray(ar, "CMYK").convert("RGB")
        
        print(file_typer.from_buffer(bts))
        
        plt.figure()
        plt.imshow(im2)
        plt.show()

The requirements file is

numpy==1.14.1
pdfminer.six==20170720
Pillow==5.1.0
matplotlib==2.1.2
python-magic==0.4.13

and the Dockerfile is

FROM python:3.6

ADD requirements.txt /

RUN pip install -r requirements.txt

ADD with_jpeg_2000s.pdf /

ADD error_demo.py /

CMD ["python", "error_demo.py"]

The file I'm testing on is here:

https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf

@radarhere
Copy link
Member

radarhere commented Sep 11, 2018

Thanks. So yes, as with #1510, the error you are receiving is because the images are truncated. Adding ImageFile.LOAD_TRUNCATED_IMAGES = True allows the images to load.

Here is a variation of your script. Running this over your attached PDF, 102 images are now processed before it hits a different error - IOError: cannot identify image file. All of the images with that error are encoded with Flate compression.

from PIL import Image, ImageFile

ImageFile.LOAD_TRUNCATED_IMAGES = True

from io import BytesIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

i = 0
with open('with_jpeg_2000s.pdf', 'rb') as fp:
	parser = PDFParser(fp)
	document = PDFDocument(parser)
	for page in PDFPage.create_pages(document):
		image_resources = page.resources.get("XObject", {})
		for im_ref in image_resources.values():
			i += 1
			
			bts = im_ref.resolve().rawdata
			
			try:
				im = Image.open(BytesIO(bts))
				im.load()
			except IOError:
				print(str(i)+" images processed")
				raise

@radarhere radarhere reopened this Apr 13, 2019
@aclark4life aclark4life added this to Backlog in Pillow May 11, 2019
@aclark4life aclark4life moved this from Backlog to In progress in Pillow May 11, 2019
@zoj613
Copy link

zoj613 commented May 7, 2020

Any progress with this?

@jacksonofalltrades
Copy link

I would also be interested to know the status of this, or if anyone has an idea of the cause or if there is a workaround. Thank you.

@radarhere
Copy link
Member

The largest failing image from the PDF is 829 bytes - I'm not convinced that valid image data is being passed to Pillow?

I think the original problem with numpy will be helped by #5379

@radarhere
Copy link
Member

Closing, unless someone can demonstrate there is a valid image that Pillow is failing to read.

Pillow automation moved this from In progress to Closed May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pillow
  
Closed
Development

No branches or pull requests

5 participants