Skip to content

fix: Prefer magic bytes to determine extraction logic#1541

Open
norepro wants to merge 8 commits into
pop-os:masterfrom
norepro:stronger-mime-detect-on-extract
Open

fix: Prefer magic bytes to determine extraction logic#1541
norepro wants to merge 8 commits into
pop-os:masterfrom
norepro:stronger-mime-detect-on-extract

Conversation

@norepro

@norepro norepro commented Jan 19, 2026

Copy link
Copy Markdown
Contributor

Extracting files uses the filename to determine the MIME type and thus the crates to invoke for extracting the file. This causes an error if the file has an incorrect extension, e.g., .tar instead of .tar.gz.

Fixing the file extension works, but it is a bad user experience, especially compared to other file managers (Dolphin, Nautilus) and utilities (tar) that succeed in the same scenario.

When extracting a file, try to determine the file type by looking at its magic bytes. If that logic fails for any reason, fallback to the filename logic.

Checking magic bytes is technically slower than simple path logic, but this is only invoked on a single file when the user has confirmed they want to extract it anyway. This logic is not used in any other scenario, e.g. determining icons to show.

Fixes #1522

Extracting files uses the filename to determine the MIME type and thus
the crates to invoke for extracting the file. This causes an error if
the file has an incorrect extension, e.g., `.tar` instead of `.tar.gz`.

Fixing the file extension works, but it is a bad user experience,
especially compared to other file managers (Dolphin, Nautilus) and
utilities (tar) that succeed in the same scenario.

When extracting a file, try to determine the file type by looking at
its magic bytes. If that logic fails for any reason, fallback to the
filename logic.

Checking magic bytes is technically slower than simple path logic, but
this is only invoked on a single file when the user has confirmed they
want to extract it anyway. This logic is _not_ used in any other
scenario, e.g. determining icons to show.

Fixes pop-os#1522
@jackpot51

Copy link
Copy Markdown
Member

We already have mime type detection, can that be reused?

@norepro

norepro commented Jan 20, 2026

Copy link
Copy Markdown
Contributor Author

We already have mime type detection, can that be reused?

With some extra effort, it can! xdg-mime-rs does support detecting via magic bytes, but it puts the responsibility of loading "enough" bytes on the caller. The crate has logic to do this, but it's unfortunately private. The other crate used, mime_guess, only supports filenames, not magic bytes.

One option is to copy that snippet into the app, but it's a bit noisy because it would also require us to populate the file metadata argument. This is the reason I went with the infer crate; it does what you expect with only a few lines of code.

It's not a hill I'll die on, but just for context!

@jackpot51

Copy link
Copy Markdown
Member

The reason I used xdg-mime for mime detection is that it can use external and upgradeable mime databases.

@norepro

norepro commented Jan 20, 2026

Copy link
Copy Markdown
Contributor Author

The reason I used xdg-mime for mime detection is that it can use external and upgradeable mime databases.

That's fair. The infer crate is indeed a self-contained database.

I'll update the PR tonight to work with xdg-mime-rs.

@norepro

norepro commented Jan 21, 2026

Copy link
Copy Markdown
Contributor Author

The reason I used xdg-mime for mime detection is that it can use external and upgradeable mime databases.

That's fair. The infer crate is indeed a self-contained database.

I'll update the PR tonight to work with xdg-mime-rs.

Quick update, I have it working with only the xdg-mime crate, but it requires hardcoding an upper bound of bytes to read for the detection. This would break if a new archive type is supported whose magic bytes are not in that initial vector.

xdg-mime solves this by calculating the minimum bytes to read from their database, but that is unfortunately also private.

I'll open an issue with them to discuss making magic byte preference easier. We'll see how it goes!

@norepro

norepro commented Feb 1, 2026

Copy link
Copy Markdown
Contributor Author

The reason I used xdg-mime for mime detection is that it can use external and upgradeable mime databases.

That's fair. The infer crate is indeed a self-contained database.
I'll update the PR tonight to work with xdg-mime-rs.

Quick update, I have it working with only the xdg-mime crate, but it requires hardcoding an upper bound of bytes to read for the detection. This would break if a new archive type is supported whose magic bytes are not in that initial vector.

xdg-mime solves this by calculating the minimum bytes to read from their database, but that is unfortunately also private.

I'll open an issue with them to discuss making magic byte preference easier. We'll see how it goes!

I have updated the code to address the feedback:

  1. Try xdg-mime-rs with first 1024 bytes of the file. This leverages the shared MIME database, if it exists.
  2. If that fails (e.g. low certainty or platform lacks shared DB), fallback to infer crate which has built-in DB.
  3. If infer crate fails, fallback to existing filename logic.

I opened ebassi/xdg-mime-rs#37 with some suggestions regarding magic byte detection. This is not blocking, however; it would simply reduce the amount of code we have in our side.

@norepro

norepro commented Apr 25, 2026

Copy link
Copy Markdown
Contributor Author

Merged with master. It looks like my bug on xdg-mime-rs hasn't had any interaction, so I do not think that we will get the reduced code in this PR any time soon. Still works and fixes the bug, though!

@norepro norepro changed the title Prefer magic bytes to determine extraction logic fix: Prefer magic bytes to determine extraction logic Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unpacking tar fails with "failed to iterate over files", works with tar in terminal

2 participants