Hello 👋
In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.
I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.
I then ran the following test script against this set of files:
require "marcel"
ARGV.each do |filename|
basename = File.basename(filename)
File.open(filename) do |file|
puts "%s %s" % [basename, Marcel::MimeType.for(file, name: basename)]
end
end
I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b
The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.
Anyway, I figured this list may be useful. Feel free to close this issue if it's not. 🥰
mimetype_for_diff-v0.3.3-a525d5b3.csv
Hello 👋
In light of the differences that are showing up in mime type reporting pre and post Tika I thought it might be nice to try and get ahead of the bug reports by trying to get a big set of example files and run the mime type detection on them before and after the change to Tika.
I found a source of about 500 files here https://gitlab.freedesktop.org/xdg/shared-mime-info/-/tree/master/tests/mime-detection. Unfortunately Tika doesn't seem to have a similar set of test files in the source afaict.
I then ran the following test script against this set of files:
I ran this script using 2 versions of Marcel - v0.3.3 and the current at time of writing HEAD - a525d5b
The attached CSV shows all the instances where a different MIME type was reported between the two versions. There are a total of 286. Most of the MIME types I would say are fairly niche and could no doubt be ignored without ever causing anyone a problem. But there are some common ones in there. And conversely the set of files is not a complete list of all MIME types known to humanity, so there will no doubt still be others that show up.
Anyway, I figured this list may be useful. Feel free to close this issue if it's not. 🥰
mimetype_for_diff-v0.3.3-a525d5b3.csv