Skip to content

Google Docs/Sheets/PPT detection is being limited to 64kb (offset 30:65536) #59

@nvh0412

Description

@nvh0412

Hi team,

Thanks for migrating this gem to use Tika and replaced mimemagic gem, we're using the latest gem version on production and so far so good, great work, thank you for your hard work!

We just figured out that some certain xlsx and docx files which are uploaded from our users are being miss-detected as application/zip, same as this issue #35

But it only happen with some files that have a size larger than 64kb

Summary:

There were 3 xlsx files:

  1. test.xlsx => 5kb => mimetype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  2. test2.xlsx => 30kb => mimetype: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  3. test3.xlsx => 368kb => mimetype: application/zip

The root cause of 3rd case is it's failed when executing a matching comparison for [Content_Types].xml with offset is 30:65536 while Google Docs/sheets have the fingerprint items at the end of the file.

Can we implement a negative offset to read from the end of the file for these cases?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions