New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIKA-2630: Wrong height and width metadata for JPEG images #255
Conversation
dameikle
commented
Oct 30, 2018
- Added extraction of image height/width from ExifSubIFDDirectory for compressed images
- Include directory name as key qualifier for Exif directories to avoid clashes
- Added extraction of image height/width from ExifSubIFDDirectory for compressed images - Include directory name as key qualifier for Exif directories to avoid clashes
Hey @tballison - given the key name clashes for the Exif metadata I am proposing to add the directory name as the qualifier, hence the request for a review. I was tempted to do this for all directories to make it clean but worry about downstream code that rely on the current values in 1.x stream. I also thought about not doing it, but without doing it for at least Exif, we will continue to give the wrong value here without some logic to have a key hierarchy in CopyUnknownFieldsHandler. |
I don't know enough about exif to be useful on this. If there are only a few standard directories (say, ExifSubIFDDirectory or ExifIFD0Descriptor), could we make those static property prefixes or static properties? Given that I don't know what we'll find in exif, I'm very hesitant about using the literal directory name. If we do go with the literal directory name in some cases, we should prefix that. WDYT? |
@tballison Thanks for looking over it - my main worry was changing the behaviour. There is a fixed structure that one should expect with data included in them only if written, so we could make them static property prefixes. I think the directory name is safe in this instance given how the format works and agree the prefix route - its easy enough to make this fixed against those directories. I'll go with that. |
Sorry, @dameikle, I should have addressed the change in behavior...the actual reason you asked for a second pair of eyes. 😆 I agree that changes in behavior are bad. IIUC, though, we'd be fixing what we're currently doing, which is over-writing info, right? If we did something like this in branch_1x:
That would maintain the same current (wrong) over-writing behavior, and introduce new tag names. I have no idea what an appropriate prefix would be for EXIF_ROOT...something static and documented and appropriate. |
I was thinking something similar but stopped as it is not always going to Exif metadata that is processed by the Extractor. I am beginning to think it might be the right thing to actually fix the overwriting, unless you completely disagree. |
I'm ok with correcting bad behavior. :D bq. going to Exif metadata that is processed by the Extractor |
Yes, it could be one of different types of metadata directories - Exif, IPTC, XMP, ICC, etc. The challenge is really that Exif uses the same tag name in different directories, so unless keyed it overwrites. |
What's the status on this? :) |
Unless there are objections, let's put this in 1.23? |
@tballison - I agree, let's go with this one |
* TIKA-2630: - Added extraction of image height/width from ExifSubIFDDirectory for compressed images - Include directory name as key qualifier for Exif directories to avoid clashes * TIKA-2630: Tidied up code # Conflicts: # tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java