Skip to content

Detect archive type by file signature when extension is wrong or missing#120

Merged
davidnewhall merged 2 commits into
golift:mainfrom
oceanplexian:feat/magic-number-detection
Feb 19, 2026
Merged

Detect archive type by file signature when extension is wrong or missing#120
davidnewhall merged 2 commits into
golift:mainfrom
oceanplexian:feat/magic-number-detection

Conversation

@oceanplexian

@oceanplexian oceanplexian commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

What's going on

Right now xtractr relies entirely on file extensions to figure out what kind of archive it's dealing with. That works great most of the time, but it falls apart when files have incorrect extensions — which happens more often than you'd think. A common example from the issue: a RAR file with a .gz extension just blows up with gzip: invalid header.

What this does

This adds file signature (magic number) detection as a fallback to the existing extension-based matching. The flow is:

  1. Try matching by extension, same as before
  2. If the extension matches but extraction fails, try reading the file's magic bytes to find the real type
  3. If no extension matches at all, try signature detection before giving up

So existing behavior is completely preserved — if your extensions are correct, nothing changes. The signature detection only kicks in when something goes wrong.

Supported signatures

ZIP, RAR (v4 + v5), 7z, gzip, bzip2, XZ, zstandard, LZ4, LZMA, brotli, AR/DEB, RPM, and ISO9660 (at all three standard offsets). TAR doesn't have a reliable magic number so it stays extension-only, which is fine.

Example

A RAR file named movie.gz that previously failed:

[ERROR] gzip.NewReader: gzip: invalid header

Now gets detected as RAR by its Rar! signature and extracts correctly.

How I tested it

Added magic_test.go with tests covering:

  • Every supported signature type individually (write magic bytes → verify detection)
  • ISO9660 at all three sector offsets (0x8001, 0x8801, 0x9001)
  • No false positives on random data
  • End-to-end mismatched extension tests: real gzip archive named .rar, real zip named .gz, gzip with completely unknown .foobar extension — all extract successfully
  • Regression test confirming normal extension-based extraction still works
  • Content verification on extracted files

All existing tests continue to pass.

Files changed

  • magic.go — signature table and detection logic
  • files.go — modified ExtractFile() to fall back to signature detection
  • magic_test.go — tests

@oceanplexian

Copy link
Copy Markdown
Contributor Author

Physical test results

Ran a standalone program that creates real archive files with wrong extensions and extracts them using the branch. All pass:

=== Physical Test: xtractr magic number detection ===

--- Test 1: Gzip file saved as .rar ---
  PASS: extracted 61 bytes, 1 files
    -> movie.rar
    Content: "This is the content of a file that was gzipped but named .rar"

--- Test 2: Zip file saved as .gz ---
  PASS: extracted 39 bytes, 1 files
    -> hello.txt
    Content: "Hello from a zip that thinks it's gzip!"

--- Test 3: Gzip file saved as .foobar ---
  PASS: extracted 42 bytes, 1 files
    -> data.foobar
    Content: "Content from a gzip with .foobar extension"

--- Test 4: Normal .zip extraction (regression) ---
  PASS: extracted 34 bytes, 1 files
    -> readme.txt
    Content: "Normal zip extraction still works!"

--- Test 5: IsArchiveFileByContent detection ---
  Gzip named .rar detected by content: true
  Plain text detected as archive: false
  .rar extension detected by IsArchiveFile: true
  PASS: content detection works correctly

=== All physical tests passed! ===

Test 1 is the exact scenario from the issue — a RAR-signatured file with .gz extension (or in this case, gzip with .rar) extracts fine now instead of failing with gzip: invalid header.

The only lint issue in CI is the pre-existing gomoddirectives warning on the iso9660 replace directive in go.mod — not from this PR.

@oceanplexian oceanplexian force-pushed the feat/magic-number-detection branch from bcf07fe to 87cde95 Compare February 19, 2026 04:11
Comment thread magic.go
Comment thread files.go Outdated
@oceanplexian oceanplexian force-pushed the feat/magic-number-detection branch from 87cde95 to c41c694 Compare February 19, 2026 04:14
Comment thread files.go Outdated
When a file has an incorrect or missing extension, xtractr now reads the
first bytes of the file to identify the archive type by its signature.
Extension-based matching is still tried first; signature detection kicks
in only as a fallback.
@oceanplexian oceanplexian force-pushed the feat/magic-number-detection branch from c41c694 to 5d851bf Compare February 19, 2026 04:16
Introduces a custom error type that stashes both the extension-based
and signature-based extraction errors so consumers can inspect all
failure details via errors.As. Also supports a Warnings slice for
non-fatal info like truncated names or extension mismatches.

@davidnewhall davidnewhall left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful!

@davidnewhall davidnewhall merged commit 52090ae into golift:main Feb 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion: Read file signature instead of extension when determing compression type Xtractr should use file content to determine archive type

2 participants