Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image with Filter "[/FlateDecode/JPXDecode]" not extracted #2087

Closed
ghost opened this issue Nov 28, 2022 · 9 comments
Closed

Image with Filter "[/FlateDecode/JPXDecode]" not extracted #2087

ghost opened this issue Nov 28, 2022 · 9 comments
Labels

Comments

@ghost
Copy link

ghost commented Nov 28, 2022

Describe the bug (mandatory)

I'm trying to iterate over all images in the document, but for one filter combination it fails.
PyMuPDF works correctly for all filter combinations except Filter [/FlateDecode/JPXDecode]. Pdfs with such image filters are correctly read by pdf readers and other python pdf libs, but PyMuPDF fails to extract image and get correct filters by xref_get_key(xref, "/Filters").

these:
<</ColorSpace/DeviceRGB/BitsPerComponent 8/Width 1672/Length 1964389/Height 1124/Name/im1/Subtype/Image/Type/XObject/Filter/JPXDecode>>

and

<</ID 26 0 R/Type/XObject/Length 476/Filter[/FlateDecode/DCTDecode]/Subtype/Image/BitsPerComponent 8/Width 126/Height 81/ColorSpace/DeviceRGB>>

are ok

this
<</ColorSpace/DeviceGray/BitsPerComponent 8/Width 67/Length 1958/Height 68/Name/im2/Subtype/Image/Type/XObject/Filter[/FlateDecode/JPXDecode]>>

fails

However, document.xref_stream(xref) correctly decompresses the stream and output is valid jpeg2000 stream.

To Reproduce (mandatory)

    document = fitz.Document(srcFileName)
    allXrefsLength = document.xref_length()
    for xref in range(1, allXrefsLength):
        if document.xref_get_key(xref, "Subtype")[1] != "/Image":
            continue

        imgDict = document.extract_image(xref)
        if not imgDict:
            tmpFilters = document.xref_get_key(xref, "/Filters")
            print("subtype of xref {0} is /Image, but pymupdf can not extract it as image. filters: {1}".format(xref, tmpFilters))

output for images with such filters:
subtype of xref 76 is /Image, but pymupdf can not extract it as image. filters: ('null', 'null')

Your configuration (mandatory)

Windows 10 x64, python 3.10, pymupdf 1.21, installed by pip install pymupdf

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Nov 28, 2022

Duplicate of #1995: the maximum supported values for image width and height currently are 2**16 = 65,536.
See that issue for details.

For the other part of the issue, extraction of the /Filter value, please allow me an example.

@ghost
Copy link
Author

ghost commented Nov 28, 2022

Duplicate of #1995: the maximum supported values for image width and height currently are 2**16 = 65,536. See that issue for details.

It happens with all image sizes, in my example it is 67 x 68 px. Only the filter combination matters.

@JorjMcKie
Copy link
Collaborator

Does doc.xref_get_key(xref, "Filter") work correctly?
Only image extraction is the problem?

@JorjMcKie
Copy link
Collaborator

You have a typo in extracting the filter names: /Filters does not exist, only /Filter.

@JorjMcKie
Copy link
Collaborator

Please provide a reproducer file.

@ghost
Copy link
Author

ghost commented Nov 28, 2022

You have a typo in extracting the filter names: /Filters does not exist, only /Filter.

My bad :( "Filter" works correctly, so image extraction is the only problem,
Attached sample file.

massage.pdf

@JorjMcKie JorjMcKie changed the title "Filter [/FlateDecode/JPXDecode]" is extracted as "('null', 'null')" and document.extract_image(xref) fails for such images Image with Filter "[/FlateDecode/JPXDecode]" not extracted Nov 28, 2022
@JorjMcKie
Copy link
Collaborator

Thanks for the file.
Fixed it.
Will be published with next version.

@ghost
Copy link
Author

ghost commented Nov 28, 2022

Thank you!

JorjMcKie added a commit that referenced this issue Nov 30, 2022
Issue 2087:
`fitz.i (extract_image)´: the type of JPX images with more than one `/Filter` are not correctly recognized if inspecting the raw stream.
Fixing this by extracting the decoded stream: we already know the type from the PDF dict.

Issue 2094:
Rectangle recognition `(helper-devices.i (jm_checkrect())` was wrong in not confirming that also x-coordinates are the same in respective corners.
Also simplified rectangle orientation detection.
JorjMcKie added a commit that referenced this issue Nov 30, 2022
This reverts commit 899ac3e.
JorjMcKie added a commit that referenced this issue Nov 30, 2022
This reverts commit 899ac3e.
JorjMcKie added a commit that referenced this issue Nov 30, 2022
Issue 2087:
`fitz.i (extract_image)´: the type of JPX images with more than one `/Filter` are not correctly recognized if inspecting the raw stream.
Fixing this by extracting the decoded stream: we already know the type from the PDF dict.

Issue 2094:
Rectangle recognition `(helper-devices.i (jm_checkrect())` was wrong in not confirming that also x-coordinates are the same in respective corners.
Also simplified rectangle orientation detection.
JorjMcKie added a commit that referenced this issue Dec 9, 2022
Fix #2110 (Discussion item #2111):
File `__main__.py` - also include the font's xref in the generated file name.

Fix #2094:
File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle.

Fix #2087:
File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
julian-smith-artifex-com pushed a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Dec 12, 2022
Fix pymupdf#2110 (Discussion item pymupdf#2111):
File `__main__.py` - also include the font's xref in the generated file name.

Fix pymupdf#2094:
File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle.

Fix pymupdf#2087:
File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
julian-smith-artifex-com pushed a commit that referenced this issue Dec 12, 2022
Fix #2110 (Discussion item #2111):
File `__main__.py` - also include the font's xref in the generated file name.

Fix #2094:
File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle.

Fix #2087:
File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in PyMuPDF-1.21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants