Skip to content

Summary of a number of differences in mime type reporting before and after Tika #48

@malclocke

Description

@malclocke

Hello 👋

In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.

I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.

I then ran the following test script against this set of files:

require "marcel"

ARGV.each do |filename|
  basename = File.basename(filename)

  File.open(filename) do |file|
    puts "%s %s" % [basename, Marcel::MimeType.for(file, name: basename)]
  end
end

I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b

The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.

Anyway, I figured this list may be useful. Feel free to close this issue if it's not. 🥰

mimetype_for_diff-v0.3.3-a525d5b3.csv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions