Detect archive type by file signature when extension is wrong or missing by oceanplexian · Pull Request #120 · golift/xtractr

oceanplexian · 2026-02-19T03:49:34Z

Fixes Suggestion: Read file signature instead of extension when determing compression type Unpackerr/unpackerr#544
Closes Xtractr should use file content to determine archive type #55

What's going on

Right now xtractr relies entirely on file extensions to figure out what kind of archive it's dealing with. That works great most of the time, but it falls apart when files have incorrect extensions — which happens more often than you'd think. A common example from the issue: a RAR file with a .gz extension just blows up with gzip: invalid header.

What this does

This adds file signature (magic number) detection as a fallback to the existing extension-based matching. The flow is:

Try matching by extension, same as before
If the extension matches but extraction fails, try reading the file's magic bytes to find the real type
If no extension matches at all, try signature detection before giving up

So existing behavior is completely preserved — if your extensions are correct, nothing changes. The signature detection only kicks in when something goes wrong.

Supported signatures

ZIP, RAR (v4 + v5), 7z, gzip, bzip2, XZ, zstandard, LZ4, LZMA, brotli, AR/DEB, RPM, and ISO9660 (at all three standard offsets). TAR doesn't have a reliable magic number so it stays extension-only, which is fine.

Example

A RAR file named movie.gz that previously failed:

[ERROR] gzip.NewReader: gzip: invalid header

Now gets detected as RAR by its Rar! signature and extracts correctly.

How I tested it

Added magic_test.go with tests covering:

Every supported signature type individually (write magic bytes → verify detection)
ISO9660 at all three sector offsets (0x8001, 0x8801, 0x9001)
No false positives on random data
End-to-end mismatched extension tests: real gzip archive named .rar, real zip named .gz, gzip with completely unknown .foobar extension — all extract successfully
Regression test confirming normal extension-based extraction still works
Content verification on extracted files

All existing tests continue to pass.

Files changed

magic.go — signature table and detection logic
files.go — modified ExtractFile() to fall back to signature detection
magic_test.go — tests

oceanplexian · 2026-02-19T03:57:13Z

Physical test results

Ran a standalone program that creates real archive files with wrong extensions and extracts them using the branch. All pass:

=== Physical Test: xtractr magic number detection ===

--- Test 1: Gzip file saved as .rar ---
  PASS: extracted 61 bytes, 1 files
    -> movie.rar
    Content: "This is the content of a file that was gzipped but named .rar"

--- Test 2: Zip file saved as .gz ---
  PASS: extracted 39 bytes, 1 files
    -> hello.txt
    Content: "Hello from a zip that thinks it's gzip!"

--- Test 3: Gzip file saved as .foobar ---
  PASS: extracted 42 bytes, 1 files
    -> data.foobar
    Content: "Content from a gzip with .foobar extension"

--- Test 4: Normal .zip extraction (regression) ---
  PASS: extracted 34 bytes, 1 files
    -> readme.txt
    Content: "Normal zip extraction still works!"

--- Test 5: IsArchiveFileByContent detection ---
  Gzip named .rar detected by content: true
  Plain text detected as archive: false
  .rar extension detected by IsArchiveFile: true
  PASS: content detection works correctly

=== All physical tests passed! ===

Test 1 is the exact scenario from the issue — a RAR-signatured file with .gz extension (or in this case, gzip with .rar) extracts fine now instead of failing with gzip: invalid header.

The only lint issue in CI is the pre-existing gomoddirectives warning on the iso9660 replace directive in go.mod — not from this PR.

When a file has an incorrect or missing extension, xtractr now reads the first bytes of the file to identify the archive type by its signature. Extension-based matching is still tried first; signature detection kicks in only as a fallback.

Introduces a custom error type that stashes both the extension-based and signature-based extraction errors so consumers can inspect all failure details via errors.As. Also supports a Warnings slice for non-fatal info like truncated names or extension mismatches.

davidnewhall

Beautiful!

oceanplexian force-pushed the feat/magic-number-detection branch from bcf07fe to 87cde95 Compare February 19, 2026 04:11

davidnewhall reviewed Feb 19, 2026

View reviewed changes

Comment thread magic.go

davidnewhall reviewed Feb 19, 2026

View reviewed changes

Comment thread files.go Outdated

oceanplexian force-pushed the feat/magic-number-detection branch from 87cde95 to c41c694 Compare February 19, 2026 04:14

davidnewhall reviewed Feb 19, 2026

View reviewed changes

Comment thread files.go Outdated

oceanplexian force-pushed the feat/magic-number-detection branch from c41c694 to 5d851bf Compare February 19, 2026 04:16

davidnewhall approved these changes Feb 19, 2026

View reviewed changes

davidnewhall merged commit 52090ae into golift:main Feb 19, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect archive type by file signature when extension is wrong or missing#120

Detect archive type by file signature when extension is wrong or missing#120
davidnewhall merged 2 commits into
golift:mainfrom
oceanplexian:feat/magic-number-detection

oceanplexian commented Feb 19, 2026 •

edited by davidnewhall

Loading

Uh oh!

oceanplexian commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidnewhall left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oceanplexian commented Feb 19, 2026 • edited by davidnewhall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's going on

What this does

Supported signatures

Example

How I tested it

Files changed

Uh oh!

oceanplexian commented Feb 19, 2026

Physical test results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidnewhall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oceanplexian commented Feb 19, 2026 •

edited by davidnewhall

Loading