Skip to content

How to: Set the content type using the UNIX file utility

Josh McArthur edited this page May 14, 2020 · 1 revision

CarrierWave uses MimeMagic and MiniMime to derive the content type of the file.

Both of these approaches can fail with some types of content types, especially less common formats. MiniMime relies on being able to test the extension (e.g. it does not analyse the actual content of the file at all), and MimeMagic uses the share mime info database from Freedesktop.org. I know that at the very least, the following files fail to be properly detected:

  • An empty zip file (e.g. a valid zip file with no files or directories) will be identified as "invalid/invalid" (it should be "application/zip")
  • A PGP-encrypted file will be identified as "invalid/invalid" (it should be "text/pgp")

These are the examples that have come up for my specific use case, and I believe there are others.

Any UNIX (Linux, BSD, Mac OS) ships with a utility called file. file accepts the path to a file, and inspects the file's contents for patterns of bytes that identify the type of file. This utility can return all sorts of handy information about the file, but in particular the --mime-type argument will cause the utility to return the mimetype of the file path. The file utility performs several different types of tests (see man find) to guess the type of the file, the most useful being to use the mimetype databases shipped by the operating system, typically located in:

  • /usr/share/misc/magic.mgc Default compiled list of magic.
  • /usr/share/misc/magic Directory containing default magic files.

The advantage of using these files is that they are usually auto-updated by system updates, so new file formats or file type indicators can be supported without library updates.

The following file patches CarrierWave::SanitizedFile to use the file utility to extract the content type. If the file utility cannot be found, then CarrierWave will fall back to the MiniMime and MimeMagic checks (the default behaviour). This file can be placed in any autoloaded file, or you can specifically require it. I placed it in lib/carrierwave_ext/sanitized_file.rb:

# lib/carrierwave_ext/sanitized_file.rb
require "carrierwave/sanitized_file"

module CarrierWave
  class SanitizedFile
    ##
    # Returns the content type of the file.
    #
    # === Returns
    #
    # [String] the content type of the file
    #
    def content_type
      @content_type ||=
        existing_content_type ||
        file_utility_content_type ||
        mime_magic_content_type ||
        mini_mime_content_type
    end

    ##
    # Uses the UNIX 'file' utility to extract the content type from
    # the file path.
    #
    # === Returns
    #
    # [String] the content type of the file, or nil if the 'file' command is not found or fails
    #          to resolve a mimetype
    def file_utility_content_type
      return unless path
      return if system("which file", out: File::NULL).blank?

      # An invalid path echos a warning to STDERR
      `file --brief --mime-type #{path.shellescape}`
        .strip
        .downcase
        .presence
    end
  end
end

Why isn't this a pull request/part of CarrierWave? Well, it could be, but I felt that the shell out to a system command was probably too much of a potential security issue to be placed into the core of Carrierwave - even though the path is shell-escaped, and the correct system methods are being used to try and prevent any security issues. It also relies on the which utility being available to check that file is going to work - I'm not sure how this works on Windows.

It could also be a gem, but again, that is just concealing the shell command being run. I think that if you need more in-depth content type extraction than MiniMime and/or MimeMagic provide, this code snippet will work well for you, but it's best to place it into your project so it's clear what is going on.

Clone this wiki locally