Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIKA-2630: Wrong height and width metadata for JPEG images #255

Merged
merged 3 commits into from Dec 2, 2019
Merged

TIKA-2630: Wrong height and width metadata for JPEG images #255

merged 3 commits into from Dec 2, 2019

Conversation

dameikle
Copy link
Member

  • Added extraction of image height/width from ExifSubIFDDirectory for compressed images
  • Include directory name as key qualifier for Exif directories to avoid clashes

- Added extraction of image height/width from ExifSubIFDDirectory for compressed images
- Include directory name as key qualifier for Exif directories to avoid clashes
@dameikle
Copy link
Member Author

Hey @tballison - given the key name clashes for the Exif metadata I am proposing to add the directory name as the qualifier, hence the request for a review.

I was tempted to do this for all directories to make it clean but worry about downstream code that rely on the current values in 1.x stream.

I also thought about not doing it, but without doing it for at least Exif, we will continue to give the wrong value here without some logic to have a key hierarchy in CopyUnknownFieldsHandler.

@tballison
Copy link
Contributor

I don't know enough about exif to be useful on this. If there are only a few standard directories (say, ExifSubIFDDirectory or ExifIFD0Descriptor), could we make those static property prefixes or static properties? Given that I don't know what we'll find in exif, I'm very hesitant about using the literal directory name. If we do go with the literal directory name in some cases, we should prefix that. WDYT?

@dameikle
Copy link
Member Author

@tballison Thanks for looking over it - my main worry was changing the behaviour. There is a fixed structure that one should expect with data included in them only if written, so we could make them static property prefixes. I think the directory name is safe in this instance given how the format works and agree the prefix route - its easy enough to make this fixed against those directories. I'll go with that.

@tballison
Copy link
Contributor

tballison commented Oct 30, 2018

Sorry, @dameikle, I should have addressed the change in behavior...the actual reason you asked for a second pair of eyes. 😆 I agree that changes in behavior are bad. IIUC, though, we'd be fixing what we're currently doing, which is over-writing info, right?

If we did something like this in branch_1x:

    if (directory instanceof ExifDirectoryBase) {
        metadata.set(directory.getName() + ":" + tag.getTagName(), tag.getDescription());
        metadata.set(tag.getTagName(), tag.getDescription());
    } else {
        metadata.set(EXIF_ROOT + ":" + tag.getTagName(), tag.getDescription());
        metadata.set(tag.getTagName(), tag.getDescription());
    }

That would maintain the same current (wrong) over-writing behavior, and introduce new tag names. I have no idea what an appropriate prefix would be for EXIF_ROOT...something static and documented and appropriate.

@dameikle
Copy link
Member Author

dameikle commented Nov 9, 2018

I was thinking something similar but stopped as it is not always going to Exif metadata that is processed by the Extractor. I am beginning to think it might be the right thing to actually fix the overwriting, unless you completely disagree.

@tballison
Copy link
Contributor

I'm ok with correcting bad behavior. :D

bq. going to Exif metadata that is processed by the Extractor
What do you mean by this exactly? In that portion of the code, it might not be Exif metadata?

@dameikle
Copy link
Member Author

Yes, it could be one of different types of metadata directories - Exif, IPTC, XMP, ICC, etc. The challenge is really that Exif uses the same tag name in different directories, so unless keyed it overwrites.

@saitho
Copy link

saitho commented May 24, 2019

What's the status on this? :)

@tballison
Copy link
Contributor

Unless there are objections, let's put this in 1.23?

@dameikle
Copy link
Member Author

@tballison - I agree, let's go with this one

@tballison tballison merged commit 90880e1 into apache:branch_1x Dec 2, 2019
tballison pushed a commit that referenced this pull request Dec 2, 2019
* TIKA-2630:
- Added extraction of image height/width from ExifSubIFDDirectory for compressed images
- Include directory name as key qualifier for Exif directories to avoid clashes

* TIKA-2630: Tidied up code

# Conflicts:
#	tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants