Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How far should Infer support file extensions ? #24

Open
ririsoft opened this issue Jul 31, 2020 · 9 comments
Open

How far should Infer support file extensions ? #24

ririsoft opened this issue Jul 31, 2020 · 9 comments

Comments

@ririsoft
Copy link
Contributor

ririsoft commented Jul 31, 2020

This issue is following discussion started in #21 code review.

Infer has limited support for file extensions. Other crates address file extension conversion from/to mime type, but I feel frustrated with all of them. Let's clarify wether my concerns could/should be addressed by Infer or in another crate.

User story

I am building web applications which serves some files on the file system or proxied from other network locations. I want to set the Content-Type header which drives the web browser behavior in some contexts.

I am also using the mime-type to build URL's with file extensions. I am using mime_guess for this task.

Video and image mime types can be detected by Infer but many text types cannot: Javascript, CSS in particular. My algorithm is to detect file content with Infer first, fallback to mime_guess when I know the file extension second, and default to "application/octet-stream" when no match found.

Concerns

  1. Data source consistency: Infer uses its own hard coded data source. mime-db and mime_guess share the same data source from jshttp/mime-db, but there is no versioning of the DB: their might be some different behaviors across compilations. Also for a given mime type Infer results may be different from mime_guess results. The official source of media types is IANA
  2. API consistency: Infer and mime_db works with &'static str while mime_guess works with MIME from the mime crate as well as a custom MimeGuess type. Using Infer with mime_guess is thus not much convenient: you need boilerplate to convert MIME to &str and back.
  3. Type aliases: neither Infer, mime_guess or mime_db support type aliases. For instance text/javascript is a type alias of application/javascript: Infer results to is_mime_supported should be true in both cases.
  4. Extension aliases: Infer does not support extension aliases, while mime_db and mime_guess do.
  5. DB Overloading: Infer allows to have custom types, and even overloading the default ones (since match custom matchers before default matchers #13), while mime-db and mime_guess does not.
  6. no_std, no_alloc: Infer allows for no_std, no_alloc since Add no_std, no_alloc support #18, while mime-db and mime_guess does not.
  7. maintenance status: Infer is maintained, so is mime-db, but mime_guess does not seem to be maintained anymore.
  8. build system: mime_db has a build script which has network dependency, while Infer and mime_guess can be built offline.

All this make it difficult for Infer to work hand in hand with either mime-db or mime_guess.

Options

  1. Improve mime_guess but the maintainer does not answer to pull requests. I am abandoning this option for now.
  2. Improve mime-db. I will eventually study this option once we clarified on 1. and 3.
  3. Address all points above within Infer, which is what I want to discuss here.
  4. Write another crate wrapping Infer and mime-db or be a complete rewrite of all. I will consider this if 1, 2 or 3 fail.

Option 2 - Improving mime-db

We could improve mime-db at a point where Infer would remove its current limited support for file extensions and advert to use mime-db instead. We could adjust both crates API so that they work hand-in-hand.

Option 3 - Merging mime_guess and mime_db into Infer.

As per my user story above having a single crate to guess a mime type from either content or file extension makes sens. Such ideal crate would have the following features:

  • no_alloc, no_std.
  • A static DB built from IANA and versioned outside of the build script.
  • Allow overloading the default DB at runtime (and eventually at compile time too).
  • Handle type and extension aliases.

@abonander, @viz-rs you are welcome to join the discussion!

@ririsoft ririsoft mentioned this issue Jul 31, 2020
@paolobarbolini
Copy link
Collaborator

paolobarbolini commented Aug 2, 2020

  1. I think it should be documented what version they are using, and we should decide how updating existing mime types, like in fcbd3c5, or even adding new ones should be handled, since that could change the output of infer::get.
  2. mime_guess could probably also use static lifetimes? Anyway, with it being an unmaintained crate there's not much we can do to fix it.
  3. We could have infer::mime_type be a slice of possible mime types, with the first one being the most recent one and so on?
  4. Same for point 3?
  5. We could open a PR to implement it for mime-db?
  6. mime-db should also be able to be no_std, no_alloc
  7. .
  8. At the moment mime-db is using a particular version of jshttp/mime-db, but I feel like this doesn't make sense. Even when they switch back to master, they could just have a script that pings them when a new jshttp/mime-db version is released and have the mime database be embedded in the crate instead of having users witness it change constantly under their feet.

Since the idea is to leave infer and mime-db as separate crates anyway, I think we could have infer::Type be a separate crate, like mime_guess does with the mime::Mime, and have both infer and mime-db return it. That way we could have something like

// infer::get and mime_db::get return Type
// Type implements Default with the the default mime type application/octet-stream and extension ""
let kind = infer::get(&buf).or_else(|| mime_db::get(&extension)).unwrap_or_default();

We would remove file extensions limited support and put something like this in all examples so that users could just use it for their crates.

@paolobarbolini
Copy link
Collaborator

mime_db has a build script which has network dependency, while Infer and mime_guess can be built offline.

Looking more into it they don't require the network to build, since the build script is disabled. Maybe it's an odd way of doing it, but they have to change that line to make it run. Else the vendored extensions.rs and types.rs are used

@paolobarbolini
Copy link
Collaborator

I made an initial PR to improve mime-db and make it no_std viz-rs/mime-db#3

@ririsoft
Copy link
Contributor Author

ririsoft commented Aug 7, 2020

Tank you for your time and comments here!

3. We could have `infer::mime_type` be a slice of possible mime types, with the first one being the most recent one and so on?

I would let infer just do one thing and do it well: infer the type from raw data. mime_db would then handle all types, extensions and their relationship. In the case of type aliases infer would return the canonical (default) type defined by IANA, leaving users the choice to use mime_db to query for aliases.

5. We could open a PR to implement it for `mime-db`?

Yes, once we clarified further our intentions with mime-db.

At the moment mime-db is using a particular version of jshttp/mime-db, but I feel like this doesn't make sense. Even when they switch back to master, they could just have a script that pings them when a new jshttp/mime-db version is released and have the mime database be embedded in the crate instead of having users witness it change constantly under their feet.

I fully agree. Also maybe mime-db could be built from IANA data directly.

Since the idea is to leave infer and mime-db as separate crates anyway, I think we could have infer::Type be a separate crate, like mime_guess does with the mime::Mime, and have both infer and mime-db return it. That way we could have something like

// infer::get and mime_db::get return Type
// Type implements Default with the the default mime type application/octet-stream and extension ""
let kind = infer::get(&buf).or_else(|| mime_db::get(&extension)).unwrap_or_default();

We would remove file extensions limited support and put something like this in all examples so that users could just use it for their crates.

It sounds great. Then we would have a Type crate (leaving the question open to merge with the Mime crate), a DB crate which allows to query mime types and return some Types and an Infer crate which infer mime types from raw data, returning static Types instance from the DB crate. This makes sense to me.

I made an initial PR to improve mime-db and make it no_std viz-rs/mime-db#3

Let's see if the maintainer is opened for changes and hopefully move forward on this issue with him. Otherwise we could progress on our own (it would be sad, but this would be a classic open source story).

@paolobarbolini
Copy link
Collaborator

paolobarbolini commented Sep 8, 2020

So, now that mime-db v1 has been out for a month and no more progress has been made on this, I think we should give another look at the current situation.

Now something like this is possible:

let buf = // some file header
let ext = // some file extension

let mime_type = infer::get(&buf).map_or_else(
    || mime_db::lookup(ext).unwrap_or("application/octet-stream"),
    |t| t.mime_type(),
); // this returns a &'static str
let extensions = mime_db::extensions(mime_type); // this returns an Option<impl Iterator<Item = &'static str>>

It's not perfect, for example mime_db::extensions unfortunately returns an Option<impl Iterator<Item = &'static str>> instead of just impl Iterator<Item = &'static str>. Also in some cases it might have been useful to have a struct implementing Iterator, instead of impl Iterator.

At least now we have &'static str instead of &str, but I wouldn't have expected the creator to release v1 so quickly 😅 . The latest versions of Rust added some really cool things, like expanding what const functions can do, and I would have liked to give a look those too.

What do you think @ririsoft?

@ririsoft
Copy link
Contributor Author

ririsoft commented Sep 9, 2020

Hi @paolobarbolini,

The situation is clearly improving thanks to your commits.

My understanding is that you are favoring mime-db instead of mime_guess. This is totally fine for me, since mime_guess authors do not show much interest in the discussion.

I agree that mime-db released a v1.0 a little bit early. A v2 will be needed if we want to change the Option<impl Iterator<Item = &'static str>> annoyance. But this is not a big deal.

I believe that Infer should share its mime database with mime-db. Is this something you have in mind too ? Shouldn't mime-db be modified so that Infer could depend on it ?

Last I believe that we should offer users a single API with combination of mime-db and Infer features. Should this be done in an external crate that would depends on Infer and mime-db ? Should this be done at Infer level ? What API are we talking about ? I am leaving this as open questions.

For instance, taking your example above, I would personnally like to have something like:

let buf = // some file header
let ext = // some file extension

let mime_type = infer::get_or_ext(&buf, ext); // this returns a &'static str and default to "application/octet-stream" if no match found.
let extensions = infer::extensions(mime_type); // this returns a struct that implements Iterator<Item = &'static str>

@abonander
Copy link

mime_guess is in passive maintenance mode. It's a tangent of a tangent of a side-project I haven't touched in years.

@ririsoft
Copy link
Contributor Author

ririsoft commented Sep 9, 2020

mime_guess is in passive maintenance mode. It's a tangent of a tangent of a side-project I haven't touched in years.

Thanks a lot for the clarification, it helps us move forward. Cheers.

@paolobarbolini
Copy link
Collaborator

paolobarbolini commented Sep 10, 2020

My understanding is that you are favoring mime-db instead of mime_guess

I wouldn't mind working with mime_guess instead, on the bright side it would have the advantage of not having both mime-db and mime_guess on most web projects (because of crates like reqwest and warp), but first we would have to change the mime crate to be able to get &'static str out of it (or return a Cow, which would be easier but a bit less elegant)

I believe that Infer should share its mime database with mime-db. Is this something you have in mind too ? Shouldn't mime-db be modified so that Infer could depend on it ?

The more I think about this API that I start to think that having a third crate just for "merging" infer's and mime-db's functionality together wouldn't be too bad. But I haven't really made progress on this, so I'll take some more time to decide.

In the meantime I'm going to work on a PR to fix the Option<impl Iterator<Item = &'static str>> thing, just so I'm not constantly reminded by the huge mistake of returning an Option<Iterator> 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants