Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot parse inline stream images, because open() always seeks 0 #7096

Closed
tatarize opened this issue Apr 16, 2023 · 13 comments · Fixed by #7097
Closed

Cannot parse inline stream images, because open() always seeks 0 #7096

tatarize opened this issue Apr 16, 2023 · 13 comments · Fixed by #7097

Comments

@tatarize
Copy link

I am parsing a file format with an inline bitmap. I reached the point at the start of the bitmap and passed the stream to Image.open(). It does a seek(0), which is a default stream.seek(0,0) which went to the start of the stream which is not where the bitmap is located.

    try:
        fp.seek(0)
    except (AttributeError, io.UnsupportedOperation):
        fp = io.BytesIO(fp.read())
        exclusive_fp = True

I expected that I could pass PIL the opened stream at the start of the file and the stream would be located at the very end of the file. To do this I suspect that:

tell = pf.tell()
fp.seek(tell)

Would have been used, and the size of the file would be determined by Pillow knowing the file types.

Worse yet, attempting to use this as an inline reader had the stream seek to the beginning of my reader, issued a false positive error on the read, and said it couldn't parse the image type. After some work on the image it became a bit obvious that it absolutely could parse a super-common 24-bit bitmap, so that message was in error and debugging found this issue.

Without inline stream reading, I needed to read the prefix myself, figure out the size, get that chunk of bytes, wrap those bytes in another BytesIO() and submit that. My workaround seems pretty clumsy looking. And if this was an arbitrary inline image, I might need a lot of different methods to determine the size.

        image_bytes = bytearray(file.read(2))  # BM
        image_length = file.read(4)  # int32le
        (size,) = struct.unpack("<I", image_length)
        image_bytes += image_length
        image_bytes += file.read(size - 6)

        from PIL import Image

        image = Image.open(BytesIO(image_bytes))

At a minimum, document the fact that streams passed as the fp object must have the image located at the beginning of the stream (regardless where the image is located within the stream).

@radarhere radarhere changed the title Cannot parse inline stream images, because open() always seeks 0. Cannot parse inline stream images, because open() always seeks 0 Apr 16, 2023
@radarhere
Copy link
Member

radarhere commented Apr 17, 2023

open() has been seeking to the start since PIL, so I can't comment on why it was added.

I'd be reluctant to change it, in case there are users relying on being able to read part of an image, and then have Pillow fail to load that same image without the user first seeking backwards.

I imagine that you considered reading the rest of the stream, but didn't like it for performance reasons?

im = Image.open(BytesIO(file.read()))

I've created PR #7097 to update the documentation.

@Yay295
Copy link
Contributor

Yay295 commented Apr 17, 2023

Not every file seeks to 0 first though:

def _accept(prefix):
return prefix[:8] == b"\x89HDF\r\n\x1a\n"
class HDF5StubImageFile(ImageFile.StubImageFile):
format = "HDF5"
format_description = "HDF5"
def _open(self):
offset = self.fp.tell()
if not _accept(self.fp.read(8)):
msg = "Not an HDF file"
raise SyntaxError(msg)
self.fp.seek(offset)

So it might actually be good to make that consistent. Here's some of the locations I see that seek to a specific offset:

def _read_blp_header(self):
self.fd.seek(4)

with open(infile_temp, "wb") as f:
# fetch length of fp
fp.seek(0, io.SEEK_END)
fsize = fp.tell()
# ensure start position
# go back
fp.seek(0)

Pillow/src/PIL/Image.py

Lines 3202 to 3206 in 1321b6e

try:
fp.seek(0)
except (AttributeError, io.UnsupportedOperation):
fp = io.BytesIO(fp.read())
exclusive_fp = True

fp.seek(0)

def _open(self):
# Quick rejection: if there's not an LF among the first
# 100 bytes, this is (probably) not a text header.
if b"\n" not in self.fp.read(100):
msg = "not an IM file"
raise SyntaxError(msg)
self.fp.seek(0)

This is when saving a file instead of reading one, but it's basically the same situation. mpf_offset is 28.

fp.seek(mpf_offset)

def _open(self):
self.fp.seek(0) # prep the fp in order to pass the JPEG test

self.fd.seek(32)

def _open(self):
# rough
self.fp.seek(2048)
s = self.fp.read(2048)

I also noticed this one, which is seeking to an offset from the start of the image, but if the image isn't at the start of the file it won't be correct. I think there might be others like this.

fp.seek(offset)

@radarhere
Copy link
Member

It isn't the individual plugins that are seeking to zero, it is Image.open() itself.

Pillow/src/PIL/Image.py

Lines 3202 to 3225 in aec7a8d

try:
fp.seek(0)
except (AttributeError, io.UnsupportedOperation):
fp = io.BytesIO(fp.read())
exclusive_fp = True
prefix = fp.read(16)
preinit()
accept_warnings = []
def _open_core(fp, filename, prefix, formats):
for i in formats:
i = i.upper()
if i not in OPEN:
init()
try:
factory, accept = OPEN[i]
result = not accept or accept(prefix)
if type(result) in [str, bytes]:
accept_warnings.append(result)
elif result:
fp.seek(0)

@Yay295
Copy link
Contributor

Yay295 commented Apr 17, 2023

Oh yeah, I guess that does happen first.

@tatarize
Copy link
Author

tatarize commented Apr 18, 2023

You could maybe consider adding a property to open() to give the seek position. I wouldn't much care passing that a seek=stream.tell() keyword (Or seek=-1 (for maintain current position, while making sure the stream is the correct type). I don't know if you'd run into a case where you want to open a filename and then offsetting by an optional seek value, but I'd assume that'd be the behavior for other open calls if seek=0 was set there.

It would also require making sure the code doesn't do anything like fp.read() since that would read all the remaining bytes from my stream which would also be a problem (unless you figured out where the end was and did a seek() to that position at the end image (which may not be at the end of the file)). --- I've only read the spec for a few different graphic formats but I highly expected to just feed the stream to Pillow and get the image there with the stream nicely located at the end of my graphics file when it finished.

@tatarize
Copy link
Author

Though that would require as Yay noted consistency among the readers to actually to also not seek to zero but rather seek to the initial fp.tell() location which could be a lot more work in a more mature library.

@radarhere
Copy link
Member

I've only read the spec for a few different graphic formats but I highly expected to just feed the stream to Pillow and get the image there with the stream nicely located at the end of my graphics file when it finished.

That's at least not how Pillow behaves for images with multiple frames, where for the sake of performance, we don't read all of the file data in on load(), and instead seek around the file as necessary (and indeed, we call the method to move between frames seek()).

@radarhere
Copy link
Member

I understand that you could request that Pillow seeks back to the end of all of the frames within the file object after each seek(), but that would seem to be an unnecessary operation for the majority of use cases.

Furthermore, some images in the wild end unexpectedly early - they are truncated. Detecting that end relies on there being no more data at the end of the file.

While it is conceivable, if complicated, for Pillow to be provided an offset to the start of an image within a file object, your expectations that

  • Pillow reads image data and then seeks to the end
  • all images are well-formed

aren't in line with the way that Pillow operates.

@radarhere
Copy link
Member

You could maybe consider adding a property to open() to give the seek position.

I'm guessing that this isn't a useful feature without those other features.

@tatarize
Copy link
Author

Yes. I would assume any embedded image is well-formed. I wouldn't expect this to be a sort of wild-form but a complete image inside another file format. But, I don't think well-formed is necessarily a requirement here. Just that no file format does a seek that isn't based off the initial position within the stream, and that we can avoid the initial seek(0) command. The former being the harder of those, it would require restricting all loader code to stream().seek(<relative>, 1) commands, or absolute seek locations based on an initial stream.tell().

Alternatively, it may need some code to maybe figure out the type and the size of the image at the seek location. Much like it already figures out what type of image, it could figure out the size and maybe just seek and load that amount of bytes and wrap it in a stream (if stream.tell() != 0).

But, it seems like a routine to detect the size of the file based on the type of the file (much like Pillow does with the filetype itself) could maybe settle the issue usefully.

Though documenting it does settle the issue at a minimum, and that's now done.

@homm
Copy link
Member

homm commented Jun 5, 2023

Adding offset to the API will lead to significant overcomplicating on many layers.

Have you tried wrappers around files objects like this one?

https://stackoverflow.com/questions/11313599/how-can-i-treat-a-section-of-a-file-as-though-its-a-file-itself

@radarhere
Copy link
Member

radarhere commented Jun 6, 2023

I would assume any embedded image is well-formed.

I'm saying that's not an assumption that Pillow can make. Some users seek support for images, even though they are malformed. See #4370 for instance.

Alternatively, it may need some code to maybe figure out the type and the size of the image at the seek location. Much like it already figures out what type of image, it could figure out the size

Pillow supports, to some extent, reading truncated images. If an image is truncated, then that means that it ends unexpectedly early, and so any information we might use to infer when the image ends would be wrong. Alternatively, it is conceivable that images might just report their size incorrectly, like in #5164.

I'm reluctant to add a new method to guess the file size because it may be wrong, and because for most use cases the image size in bytes seems obvious, as it is the file size.

If we were to start reading images relative to the current position, I don't see how that would be useful to your situation without either the ability to predict the seek position relative to the image when reading is done (which isn't always at the end of the image), or to know the image size in bytes.

Though documenting it does settle the issue at a minimum, and that's now done.

Because of this, and because I don't think any changes to Pillow's behaviour are viable for reasons listed here and previously, if #7097 is merged to update the documentation, this will be closed.

If you still have further thoughts, feel free to post them.

@tatarize
Copy link
Author

tatarize commented Jun 6, 2023

homm's suggestion is actually really clever.

While I solved my code and wrapped it in another buffer moving the bytes over by solving how large the .bmp image needs to be. There's actually a somewhat reasonable chance that wrapping it into an offset-position fileobject where seek(0) would not move my position would actually have worked.

There was always a chance that an embedded image file could have had a non-obvious size and I would have previously been out of luck. It's certainly not 100% but I assume embedded images should be well-formed and complete. And there's some reasonable expectation that for some embedded images (including actually the one that lead to this issue) could be solved by simply duck-casting an offset file-stream. Though I malformed files would always fail and I'm not sure there's any guarantee that the files end at the end of the stream (since the data can be interspersed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants