Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

stuartspotlight · 2018-06-01T14:29:41Z

What did you do?

I extracted a JPEG 2000 from a pdf as bytes. I then loaded the result into Pillow using

im = Image.open(BytesIO(raw))

Next I attempted to convert to a numpy array in order to manipulate the data. Using

A = np.array(im)

This resulted in the array

array(<PIL.Jpeg2KImagePlugin.Jpeg2KImageFile image mode=RGBA size=1598x1598 at 0x7FDEAF719A20>,
      dtype=object)

When I attempted to force numpy to convert this array to a sequence of numbers the result was the error:-

Traceback (most recent call last):

  File "<ipython-input-39-042ef64a6b36>", line 1, in <module>
    runfile('/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images/test_embeded_image_extractor_locally.py', wdir='/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images')

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "/home/stuart/Documents/Python_programs/embedded_image_processing/classify_embeded_images/test_embeded_image_extractor_locally.py", line 402, in <module>
    A = np.array(im.getdata())

  File "/usr/local/lib/python3.5/dist-packages/PIL/Image.py", line 1220, in getdata
    self.load()

  File "/usr/local/lib/python3.5/dist-packages/PIL/Jpeg2KImagePlugin.py", line 210, in load
    return ImageFile.ImageFile.load(self)

  File "/usr/local/lib/python3.5/dist-packages/PIL/ImageFile.py", line 250, in load
    raise_ioerror(err_code)

  File "/usr/local/lib/python3.5/dist-packages/PIL/ImageFile.py", line 59, in raise_ioerror
    raise IOError(message + " when reading image file")

OSError: broken data stream when reading image file

This thread had a similar issue with loading jpeg2000 files but I don't understand their resolution #1510

What did you expect to happen?

I expected the image to be converted into a numpy array.

What actually happened?

The system would either error or create an array containing the image object

What versions of Pillow and Python are you using?

python 3.5/3.6 (3.6 when running inside a docker container)
Pillow==5.1.0
numpy==1.14.1

I am using pdfminer.six to extract the image on the first page of this document as a test:-

https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf

The really odd issue is that when I try convert using the console using the exact same command,

a = np.array(im)

I get a numpy array as expected

array([[[140, 118, 219,  82],
        [145, 114, 210,  84],
        [147, 111, 195,  86],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       [[149, 112, 213,  84],
        [143, 115, 206,  84],
        [133, 119, 193,  83],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       [[155, 106, 203,  86],
        [138, 116, 198,  83],
        [119, 129, 188,  77],
        ...,
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0],
        [ 27,  45,  62,   0]],

       ...,

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]],

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]],

       [[136,  94,  86,   7],
        [135,  94,  85,   7],
        [135,  94,  85,   7],
        ...,
        [158, 124, 100,  28],
        [158, 124, 100,  28],
        [158, 124, 100,  28]]], dtype=uint8)

Can anyone think of a reason for this discrepancy that would allow me to perform the conversion while running my code?

Thanks

The text was updated successfully, but these errors were encountered:

stuartspotlight · 2018-06-01T15:20:55Z

Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed. Does anyone have any thoughts as to why this might be?

radarhere · 2018-09-01T10:28:14Z

When you say that ‘Further experimentation indicates that I can convert the bytes stream to an image when the pdf file the image is extracted from is closed’ - so if you save the image from pdfminer.six to a file, and run Pillow and numpy over it in a separate script, there is no problem?

Could you provide a self-contained script that demonstrates the error?

stuartspotlight · 2018-09-10T11:19:49Z

Sorry for the delay in my reply, I've been quite busy. In order to ensure reproducibility I've turned it into a docker image. The python code to get the

broken data stream when reading image file

error is

# -*- coding: utf-8 -*-
"""
Created on Thu Sep  6 12:30:15 2018

@author: stuart
"""

#demonstrating the jpeg2K error

import matplotlib.pyplot as plt

from PIL import Image

from io import BytesIO
import magic

import numpy as np

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage


filename = './with_jpeg_2000s.pdf'

image_folder = './'

fp = open(filename, 'rb')

parser = PDFParser(fp)

document = PDFDocument(parser)


file_typer = magic.Magic()

pages = []
for pagei, page in enumerate(PDFPage.create_pages(document)):
    
    pages.append(page)
    
    
for pagei, page in enumerate(pages):
    
    page_jpeg2000s = {}
    
    resources = page.resources
    
    file_typer = magic.Magic()
    try:
        image_resources = resources["XObject"]
    except:
        image_resources = {}
        
    #if its not a dict we have another layer of wrapping to go through
    if not isinstance(image_resources, dict):
        image_resources = image_resources.resolve()
    
    #lt.log_debug(logger, image_resources)
    save_images = []
    for image_i, image_res in enumerate(image_resources.keys()):
             
        im_filename = image_folder + "/image-%d.jpeg"%image_i      
        
        im_ref = image_resources[image_res]

    
        bts = im_ref.resolve().rawdata
        
        #raw = bts.rawdata
        print(pagei)
        print(image_i)
        
        Im = Image.open(BytesIO(bts))

        #load the file into memory
        Im.load()
        
        ar = np.array(Im)
        
        im2 = Image.fromarray(ar, "CMYK").convert("RGB")
        
        print(file_typer.from_buffer(bts))
        
        plt.figure()
        plt.imshow(im2)
        plt.show()

The requirements file is

numpy==1.14.1
pdfminer.six==20170720
Pillow==5.1.0
matplotlib==2.1.2
python-magic==0.4.13

and the Dockerfile is

FROM python:3.6

ADD requirements.txt /

RUN pip install -r requirements.txt

ADD with_jpeg_2000s.pdf /

ADD error_demo.py /

CMD ["python", "error_demo.py"]

The file I'm testing on is here:

https://hartley-botanic.co.uk/wp-content/uploads/2017/07/Hartley-guide-greenhouse-gardening.pdf

radarhere · 2018-09-11T11:19:57Z

Thanks. So yes, as with #1510, the error you are receiving is because the images are truncated. Adding ImageFile.LOAD_TRUNCATED_IMAGES = True allows the images to load.

Here is a variation of your script. Running this over your attached PDF, 102 images are now processed before it hits a different error - IOError: cannot identify image file. All of the images with that error are encoded with Flate compression.

from PIL import Image, ImageFile

ImageFile.LOAD_TRUNCATED_IMAGES = True

from io import BytesIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

i = 0
with open('with_jpeg_2000s.pdf', 'rb') as fp:
	parser = PDFParser(fp)
	document = PDFDocument(parser)
	for page in PDFPage.create_pages(document):
		image_resources = page.resources.get("XObject", {})
		for im_ref in image_resources.values():
			i += 1
			
			bts = im_ref.resolve().rawdata
			
			try:
				im = Image.open(BytesIO(bts))
				im.load()
			except IOError:
				print(str(i)+" images processed")
				raise

zoj613 · 2020-05-07T20:16:11Z

Any progress with this?

jacksonofalltrades · 2020-05-13T04:23:35Z

I would also be interested to know the status of this, or if anyone has an idea of the cause or if there is a workaround. Thank you.

radarhere · 2021-05-01T03:55:01Z

The largest failing image from the PDF is 829 bytes - I'm not convinced that valid image data is being passed to Pillow?

I think the original problem with numpy will be helped by #5379

radarhere · 2021-05-19T13:51:56Z

Closing, unless someone can demonstrate there is a valid image that Pillow is failing to read.

stuartspotlight mentioned this issue Jun 4, 2018

Cannot read jpeg 2000s pdfminer/pdfminer.six#150

Closed

aclark4life added Question JPEG NumPy and removed NumPy labels Jun 30, 2018

radarhere closed this as completed Apr 13, 2019

radarhere reopened this Apr 13, 2019

aclark4life added Conversion Memory labels May 11, 2019

aclark4life added this to Backlog in Pillow May 11, 2019

aclark4life moved this from Backlog to In progress in Pillow May 11, 2019

jacksonofalltrades mentioned this issue May 13, 2020

Problem with reading JPEG image pydicom/pydicom#776

Closed

radarhere closed this as completed May 19, 2021

Pillow automation moved this from In progress to Closed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

stuartspotlight commented Jun 1, 2018 •

edited by hugovk

stuartspotlight commented Jun 1, 2018

radarhere commented Sep 1, 2018

stuartspotlight commented Sep 10, 2018 •

edited by hugovk

radarhere commented Sep 11, 2018 •

edited

zoj613 commented May 7, 2020

jacksonofalltrades commented May 13, 2020

radarhere commented May 1, 2021

radarhere commented May 19, 2021

Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

Converting JPEG2000 from bytestream to numpy array not functioning as expected #3152

Comments

stuartspotlight commented Jun 1, 2018 • edited by hugovk

What did you do?

What did you expect to happen?

What actually happened?

What versions of Pillow and Python are you using?

stuartspotlight commented Jun 1, 2018

radarhere commented Sep 1, 2018

stuartspotlight commented Sep 10, 2018 • edited by hugovk

radarhere commented Sep 11, 2018 • edited

zoj613 commented May 7, 2020

jacksonofalltrades commented May 13, 2020

radarhere commented May 1, 2021

radarhere commented May 19, 2021

stuartspotlight commented Jun 1, 2018 •

edited by hugovk

stuartspotlight commented Sep 10, 2018 •

edited by hugovk

radarhere commented Sep 11, 2018 •

edited