Cannot parse inline stream images, because open() always seeks 0 #7096

tatarize · 2023-04-16T12:34:27Z

I am parsing a file format with an inline bitmap. I reached the point at the start of the bitmap and passed the stream to Image.open(). It does a seek(0), which is a default stream.seek(0,0) which went to the start of the stream which is not where the bitmap is located.

    try:
        fp.seek(0)
    except (AttributeError, io.UnsupportedOperation):
        fp = io.BytesIO(fp.read())
        exclusive_fp = True

I expected that I could pass PIL the opened stream at the start of the file and the stream would be located at the very end of the file. To do this I suspect that:

tell = pf.tell()
fp.seek(tell)

Would have been used, and the size of the file would be determined by Pillow knowing the file types.

Worse yet, attempting to use this as an inline reader had the stream seek to the beginning of my reader, issued a false positive error on the read, and said it couldn't parse the image type. After some work on the image it became a bit obvious that it absolutely could parse a super-common 24-bit bitmap, so that message was in error and debugging found this issue.

Without inline stream reading, I needed to read the prefix myself, figure out the size, get that chunk of bytes, wrap those bytes in another BytesIO() and submit that. My workaround seems pretty clumsy looking. And if this was an arbitrary inline image, I might need a lot of different methods to determine the size.

        image_bytes = bytearray(file.read(2))  # BM
        image_length = file.read(4)  # int32le
        (size,) = struct.unpack("<I", image_length)
        image_bytes += image_length
        image_bytes += file.read(size - 6)

        from PIL import Image

        image = Image.open(BytesIO(image_bytes))

At a minimum, document the fact that streams passed as the fp object must have the image located at the beginning of the stream (regardless where the image is located within the stream).

The text was updated successfully, but these errors were encountered:

radarhere · 2023-04-17T06:12:05Z

open() has been seeking to the start since PIL, so I can't comment on why it was added.

I'd be reluctant to change it, in case there are users relying on being able to read part of an image, and then have Pillow fail to load that same image without the user first seeking backwards.

I imagine that you considered reading the rest of the stream, but didn't like it for performance reasons?

im = Image.open(BytesIO(file.read()))

I've created PR #7097 to update the documentation.

Yay295 · 2023-04-17T14:17:24Z

Not every file seeks to 0 first though:

Pillow/src/PIL/Hdf5StubImagePlugin.py

Lines 31 to 46 in 70aaa20

    
           def _accept(prefix): 
        
               return prefix[:8] == b"\x89HDF\r\n\x1a\n" 
        
           class HDF5StubImageFile(ImageFile.StubImageFile): 
        
               format = "HDF5" 
        
               format_description = "HDF5" 
        
               def _open(self): 
        
                   offset = self.fp.tell() 
        
                   if not _accept(self.fp.read(8)): 
        
                       msg = "Not an HDF file" 
        
                       raise SyntaxError(msg) 
        
                   self.fp.seek(offset)

So it might actually be good to make that consistent. Here's some of the locations I see that seek to a specific offset:

Pillow/src/PIL/BlpImagePlugin.py

Lines 285 to 286 in 1321b6e

    
           def _read_blp_header(self): 
        
               self.fd.seek(4)

Pillow/src/PIL/EpsImagePlugin.py

Lines 97 to 103 in 1321b6e

    
           with open(infile_temp, "wb") as f: 
        
               # fetch length of fp 
        
               fp.seek(0, io.SEEK_END) 
        
               fsize = fp.tell() 
        
               # ensure start position 
        
               # go back 
        
               fp.seek(0)

Pillow/src/PIL/Image.py

Lines 3202 to 3206 in 1321b6e

    
           try: 
        
               fp.seek(0) 
        
           except (AttributeError, io.UnsupportedOperation): 
        
               fp = io.BytesIO(fp.read()) 
        
               exclusive_fp = True

Pillow/src/PIL/Image.py

Line 3225 in 1321b6e

fp.seek(0)

Pillow/src/PIL/ImImagePlugin.py

Lines 122 to 129 in 1321b6e

    
           def _open(self): 
        
               # Quick rejection: if there's not an LF among the first 
        
               # 100 bytes, this is (probably) not a text header. 
        
               if b"\n" not in self.fp.read(100): 
        
                   msg = "not an IM file" 
        
                   raise SyntaxError(msg) 
        
               self.fp.seek(0)

This is when saving a file instead of reading one, but it's basically the same situation. mpf_offset is 28.

Pillow/src/PIL/MpoImagePlugin.py

Line 94 in 1321b6e

fp.seek(mpf_offset)

Pillow/src/PIL/MpoImagePlugin.py

Lines 108 to 109 in 1321b6e

    
           def _open(self): 
        
               self.fp.seek(0)  # prep the fp in order to pass the JPEG test

Pillow/src/PIL/MspImagePlugin.py

Line 115 in 1321b6e

self.fd.seek(32)

Pillow/src/PIL/PcdImagePlugin.py

Lines 30 to 33 in 1321b6e

    
           def _open(self): 
        
               # rough 
        
               self.fp.seek(2048) 
        
               s = self.fp.read(2048)

I also noticed this one, which is seeking to an offset from the start of the image, but if the image isn't at the start of the file it won't be correct. I think there might be others like this.

Pillow/src/PIL/PcfFontFile.py

Line 110 in 1321b6e

fp.seek(offset)

radarhere · 2023-04-17T21:41:20Z

It isn't the individual plugins that are seeking to zero, it is Image.open() itself.

Pillow/src/PIL/Image.py

Lines 3202 to 3225 in aec7a8d

    
           try: 
        
               fp.seek(0) 
        
           except (AttributeError, io.UnsupportedOperation): 
        
               fp = io.BytesIO(fp.read()) 
        
               exclusive_fp = True 
        
           prefix = fp.read(16) 
        
           preinit() 
        
           accept_warnings = [] 
        
           def _open_core(fp, filename, prefix, formats): 
        
               for i in formats: 
        
                   i = i.upper() 
        
                   if i not in OPEN: 
        
                       init() 
        
                   try: 
        
                       factory, accept = OPEN[i] 
        
                       result = not accept or accept(prefix) 
        
                       if type(result) in [str, bytes]: 
        
                           accept_warnings.append(result) 
        
                       elif result: 
        
                           fp.seek(0)

Yay295 · 2023-04-17T22:03:10Z

Oh yeah, I guess that does happen first.

tatarize · 2023-04-18T18:43:48Z

You could maybe consider adding a property to open() to give the seek position. I wouldn't much care passing that a seek=stream.tell() keyword (Or seek=-1 (for maintain current position, while making sure the stream is the correct type). I don't know if you'd run into a case where you want to open a filename and then offsetting by an optional seek value, but I'd assume that'd be the behavior for other open calls if seek=0 was set there.

It would also require making sure the code doesn't do anything like fp.read() since that would read all the remaining bytes from my stream which would also be a problem (unless you figured out where the end was and did a seek() to that position at the end image (which may not be at the end of the file)). --- I've only read the spec for a few different graphic formats but I highly expected to just feed the stream to Pillow and get the image there with the stream nicely located at the end of my graphics file when it finished.

tatarize · 2023-04-18T18:55:08Z

Though that would require as Yay noted consistency among the readers to actually to also not seek to zero but rather seek to the initial fp.tell() location which could be a lot more work in a more mature library.

radarhere · 2023-04-19T04:21:10Z

I've only read the spec for a few different graphic formats but I highly expected to just feed the stream to Pillow and get the image there with the stream nicely located at the end of my graphics file when it finished.

That's at least not how Pillow behaves for images with multiple frames, where for the sake of performance, we don't read all of the file data in on load(), and instead seek around the file as necessary (and indeed, we call the method to move between frames seek()).

radarhere · 2023-04-19T22:21:54Z

I understand that you could request that Pillow seeks back to the end of all of the frames within the file object after each seek(), but that would seem to be an unnecessary operation for the majority of use cases.

Furthermore, some images in the wild end unexpectedly early - they are truncated. Detecting that end relies on there being no more data at the end of the file.

While it is conceivable, if complicated, for Pillow to be provided an offset to the start of an image within a file object, your expectations that

Pillow reads image data and then seeks to the end
all images are well-formed

aren't in line with the way that Pillow operates.

radarhere · 2023-04-20T21:59:13Z

You could maybe consider adding a property to open() to give the seek position.

I'm guessing that this isn't a useful feature without those other features.

tatarize · 2023-04-21T01:52:11Z

Yes. I would assume any embedded image is well-formed. I wouldn't expect this to be a sort of wild-form but a complete image inside another file format. But, I don't think well-formed is necessarily a requirement here. Just that no file format does a seek that isn't based off the initial position within the stream, and that we can avoid the initial seek(0) command. The former being the harder of those, it would require restricting all loader code to stream().seek(<relative>, 1) commands, or absolute seek locations based on an initial stream.tell().

Alternatively, it may need some code to maybe figure out the type and the size of the image at the seek location. Much like it already figures out what type of image, it could figure out the size and maybe just seek and load that amount of bytes and wrap it in a stream (if stream.tell() != 0).

But, it seems like a routine to detect the size of the file based on the type of the file (much like Pillow does with the filetype itself) could maybe settle the issue usefully.

Though documenting it does settle the issue at a minimum, and that's now done.

homm · 2023-06-05T15:10:47Z

Adding offset to the API will lead to significant overcomplicating on many layers.

Have you tried wrappers around files objects like this one?

https://stackoverflow.com/questions/11313599/how-can-i-treat-a-section-of-a-file-as-though-its-a-file-itself

radarhere · 2023-06-06T10:55:20Z

I would assume any embedded image is well-formed.

I'm saying that's not an assumption that Pillow can make. Some users seek support for images, even though they are malformed. See #4370 for instance.

Alternatively, it may need some code to maybe figure out the type and the size of the image at the seek location. Much like it already figures out what type of image, it could figure out the size

Pillow supports, to some extent, reading truncated images. If an image is truncated, then that means that it ends unexpectedly early, and so any information we might use to infer when the image ends would be wrong. Alternatively, it is conceivable that images might just report their size incorrectly, like in #5164.

I'm reluctant to add a new method to guess the file size because it may be wrong, and because for most use cases the image size in bytes seems obvious, as it is the file size.

If we were to start reading images relative to the current position, I don't see how that would be useful to your situation without either the ability to predict the seek position relative to the image when reading is done (which isn't always at the end of the image), or to know the image size in bytes.

Though documenting it does settle the issue at a minimum, and that's now done.

Because of this, and because I don't think any changes to Pillow's behaviour are viable for reasons listed here and previously, if #7097 is merged to update the documentation, this will be closed.

If you still have further thoughts, feel free to post them.

tatarize · 2023-06-06T21:48:43Z

homm's suggestion is actually really clever.

While I solved my code and wrapped it in another buffer moving the bytes over by solving how large the .bmp image needs to be. There's actually a somewhat reasonable chance that wrapping it into an offset-position fileobject where seek(0) would not move my position would actually have worked.

There was always a chance that an embedded image file could have had a non-obvious size and I would have previously been out of luck. It's certainly not 100% but I assume embedded images should be well-formed and complete. And there's some reasonable expectation that for some embedded images (including actually the one that lead to this issue) could be solved by simply duck-casting an offset file-stream. Though I malformed files would always fail and I'm not sure there's any guarantee that the files end at the end of the stream (since the data can be interspersed).

radarhere changed the title ~~Cannot parse inline stream images, because open() always seeks 0.~~ Cannot parse inline stream images, because open() always seeks 0 Apr 16, 2023

radarhere mentioned this issue Apr 17, 2023

Image.open() seeks to the start of file objects #7097

Merged

hugovk closed this as completed in #7097 Jun 29, 2023

cgohlke mentioned this issue Aug 6, 2023

Seek(0) needs to be called only for tiff images scikit-image/scikit-image#7086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot parse inline stream images, because open() always seeks 0 #7096

Cannot parse inline stream images, because open() always seeks 0 #7096

tatarize commented Apr 16, 2023

radarhere commented Apr 17, 2023 •

edited

Yay295 commented Apr 17, 2023

radarhere commented Apr 17, 2023

Yay295 commented Apr 17, 2023

tatarize commented Apr 18, 2023 •

edited by radarhere

tatarize commented Apr 18, 2023

radarhere commented Apr 19, 2023

radarhere commented Apr 19, 2023

radarhere commented Apr 20, 2023

tatarize commented Apr 21, 2023

homm commented Jun 5, 2023

radarhere commented Jun 6, 2023 •

edited

tatarize commented Jun 6, 2023

Cannot parse inline stream images, because open() always seeks 0 #7096

Cannot parse inline stream images, because open() always seeks 0 #7096

Comments

tatarize commented Apr 16, 2023

radarhere commented Apr 17, 2023 • edited

Yay295 commented Apr 17, 2023

radarhere commented Apr 17, 2023

Yay295 commented Apr 17, 2023

tatarize commented Apr 18, 2023 • edited by radarhere

tatarize commented Apr 18, 2023

radarhere commented Apr 19, 2023

radarhere commented Apr 19, 2023

radarhere commented Apr 20, 2023

tatarize commented Apr 21, 2023

homm commented Jun 5, 2023

radarhere commented Jun 6, 2023 • edited

tatarize commented Jun 6, 2023

radarhere commented Apr 17, 2023 •

edited

tatarize commented Apr 18, 2023 •

edited by radarhere

radarhere commented Jun 6, 2023 •

edited