Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not detect old Office formats #46

Open
erikian opened this issue Sep 3, 2021 · 2 comments
Open

Does not detect old Office formats #46

erikian opened this issue Sep 3, 2021 · 2 comments

Comments

@erikian
Copy link

erikian commented Sep 3, 2021

Not sure if that's related to #45 , but my program panicks with thread 'main' panicked at 'assertion failed: sector_id < self.num_sectors', C:\Users\ian\.cargo\registry\src\github.com-1ecc6299db9ec823\cfb-0.4.0\src\internal\sector.rs:65:9 when I try to use infer::get_from_path to read .doc, .ppt and .xls files. I've created the files test.doc, test.xls and test.ppt to test this behavior, and except for test.ppt, they all cause the same error (I've included a different .ppt file which does crash for testing).

I'm using v0.5.0 on Windows 10. The code is based on example/file.rs. Here are the files I used: infer.zip

use std::env::args;
use std::process::exit;

fn main() {
    let path = "path\\to\\test.doc";

    match infer::get_from_path(path) {
        Ok(Some(info)) => {
            println!("Through the arcane magic of this crate we determined the file type to be");
            println!("mime type: {}", info.mime_type());
            println!("extension: {}", info.extension());
        }
        Ok(None) => {
            eprintln!("Unknown file type 😞");
            eprintln!("If you think infer should be able to recognize this file type open an issue on GitHub!");
            exit(1);
        }
        Err(e) => {
            eprintln!("Looks like something went wrong 😔");
            eprintln!("{}", e);
            exit(1);
        }
    }
}
@yveszoundi
Copy link

yveszoundi commented Dec 24, 2022

I experienced the same issues for both legacy Microsoft Office formats but also OpenDocument and `Microsoft Open XML' formats.

Problems observed

  • The mime type detection appeared unreliable "at times" with newly created files:
    • Newer Office files detected as ZIP (OpenDocument or Microsoft Open XML)
    • Legacy MS office files detected as application/octet-stream (if I recall correctly)

Potential root cause

Is this because the "matching logic" is not hierarchical for some instances where it should? I do see some MatcherType::Archive subsets but shouldn't that be nested options of given types (zip, etc.),

Potential solution

This is not based on any study or deep understanding of the current code. While the "matchers" ordering implies that it should reliably work, this is not really the case.

If matchers were traversed in a tree like structure vs a pure linear iteration, I think that the problem might go away.

  • zipMatcher.addSubMatcherFunction(matchForOpenDocument);
  • zipMatcher.addSubMatcherFunction(matchForOpenXML);

Workaround

The "workaround" sadly lead to implementing the relevant logic manually. In my particular application, I'm only concerned about few known/supported file types and I'm also not necessarily doing a "super robust" job either (validations, etc.)

If the file has a Zip format
   If the file has a Microsoft Open XML format based on Zip contents
     Then return the supported detected Microsoft Office format
   If the file has an OpenDocument format based on Zip contents
      Then return the supported detected OpenDocument format
   Otherwise just return the zip mime type   

Else if the file has a Microsoft Compound File Binary format
  Get the value of the root directory entry for its object class GUID (CLSID) 
  Match the value against file types that we care about
  Return the detected mime type or `application/octet-stream`

@yveszoundi
Copy link

Here is my poor man mime detection logic (not robust): https://github.com/rimerosolutions/entrusted/blob/main/app/entrusted_container/src/mimetypes.rs

I'd much prefer switching back to this crate when this issue is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants