#146's end-of-document HTML check misses partial reads (e.g. Active Storage's 4 KB identify chunk)

Follow-up to #102 / #149 (fixed by #146 — thanks!). #146 fully fixes *full-document* detection, but the new heuristic has a gap for consumers that identify a type from a **leading byte range** rather than the whole file.

#146 detects HTML by matching either the **start** (`\A\s*<(!DOCTYPE html|html)`) **or** the **end** (`</html>\s*\z`) of the IO. The end-of-document check is what rescues documents whose first bytes aren't a bare tag — i.e. a leading HTML comment, a UTF-8 BOM, or an `<?xml …?>` declaration.

The problem: **Active Storage identifies content types from only the first 4 KB of the blob.** `ActiveStorage::Blob::Identifiable#identify_content_type` does:

```ruby
Marcel::MimeType.for(
  service.download_chunk(key, 0...4.kilobytes),
  name: filename.to_s,
  declared_type: content_type
)
```

A partial reader like this never sees the trailing `</html>`. So for **any HTML file larger than 4 KB whose first bytes aren't a bare tag** (leading comment / BOM / `<?xml>` prolog), *neither* magic matches within the chunk and detection falls through to `application/xml` (or `text/plain` for a BOM).

Passing `name:` / `declared_type: "text/html"` does **not** rescue it, because `most_specific_type` keeps the magic result — `text/html` is not a registered child of `application/xml` / `text/plain`.

### Repro (against `main` @ `7e2cbc1`, i.e. *with* #146)

```ruby
require "marcel"; require "stringio"

# An HTML document > 4 KB that begins with a comment.
# (Same result with a leading UTF-8 BOM or an `<?xml ...?>` declaration.)
html = "\n" \
       "<html><body>\n" + ("<p>x</p>\n" * 600) + "</body></html>"   # ~5.5 KB

chunk = html.byteslice(0, 4096)   # what Active Storage feeds Marcel

Marcel::MimeType.for(StringIO.new(html))                                                  # => "text/html"      ✅ fixed by #146
Marcel::MimeType.for(StringIO.new(chunk), name: "page.html", declared_type: "text/html")  # => "application/xml" ❌
```

Found in the wild with a 218 KB HTML file (a generated UI mockup that prepends a `` description on line 1, and ends with `</html>`): on `main`, `for(full)` → `text/html` but `for(first 4 KB)` → `application/xml`.

### Real-world impact

Active Storage stores the misidentified type. `application/xml` is in `content_types_to_serve_as_binary` and `text/plain` is not in `content_types_allowed_inline`, so such HTML is served with `Content-Disposition: attachment` and **downloads instead of previewing inline**. This is a regression from marcel 1.1.0, which returned `text/html` for all of these inputs. (It also reproduces on the released 1.2.1, which doesn't contain #146 at all.)

### Suggested direction

Add a **start-anchored** magic that tolerates a leading BOM / whitespace / `<?xml …?>` declaration / HTML comments before the first tag — essentially the start-anchored regex proposed in #149, extended to allow an optional XML prolog — so detection doesn't depend on the document's *end* (which partial-read consumers like Active Storage never see). Keeping the end-of-document `</html>` check as an additional signal is fine.

Versions: marcel `main` (#146) and 1.2.1; Ruby 4.0.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#146's end-of-document HTML check misses partial reads (e.g. Active Storage's 4 KB identify chunk) #152

Repro (against `main` @ `7e2cbc1`, i.e. with #146)

Real-world impact

Suggested direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#146's end-of-document HTML check misses partial reads (e.g. Active Storage's 4 KB identify chunk) #152

Description

Repro (against main @ 7e2cbc1, i.e. with #146)

Real-world impact

Suggested direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Repro (against `main` @ `7e2cbc1`, i.e. with #146)