Skip to content

#146's end-of-document HTML check misses partial reads (e.g. Active Storage's 4 KB identify chunk) #152

@mechiland

Description

@mechiland

Follow-up to #102 / #149 (fixed by #146 — thanks!). #146 fully fixes full-document detection, but the new heuristic has a gap for consumers that identify a type from a leading byte range rather than the whole file.

#146 detects HTML by matching either the start (\A\s*<(!DOCTYPE html|html)) or the end (</html>\s*\z) of the IO. The end-of-document check is what rescues documents whose first bytes aren't a bare tag — i.e. a leading HTML comment, a UTF-8 BOM, or an <?xml …?> declaration.

The problem: Active Storage identifies content types from only the first 4 KB of the blob. ActiveStorage::Blob::Identifiable#identify_content_type does:

Marcel::MimeType.for(
  service.download_chunk(key, 0...4.kilobytes),
  name: filename.to_s,
  declared_type: content_type
)

A partial reader like this never sees the trailing </html>. So for any HTML file larger than 4 KB whose first bytes aren't a bare tag (leading comment / BOM / <?xml> prolog), neither magic matches within the chunk and detection falls through to application/xml (or text/plain for a BOM).

Passing name: / declared_type: "text/html" does not rescue it, because most_specific_type keeps the magic result — text/html is not a registered child of application/xml / text/plain.

Repro (against main @ 7e2cbc1, i.e. with #146)

require "marcel"; require "stringio"

# An HTML document > 4 KB that begins with a comment.
# (Same result with a leading UTF-8 BOM or an `<?xml ...?>` declaration.)
html = "<!-- leading comment, padded past 128 chars: " + ("a" * 90) + " -->\n" \
       "<html><body>\n" + ("<p>x</p>\n" * 600) + "</body></html>"   # ~5.5 KB

chunk = html.byteslice(0, 4096)   # what Active Storage feeds Marcel

Marcel::MimeType.for(StringIO.new(html))                                                  # => "text/html"      ✅ fixed by #146
Marcel::MimeType.for(StringIO.new(chunk), name: "page.html", declared_type: "text/html")  # => "application/xml" ❌

Found in the wild with a 218 KB HTML file (a generated UI mockup that prepends a <!-- … --> description on line 1, and ends with </html>): on main, for(full)text/html but for(first 4 KB)application/xml.

Real-world impact

Active Storage stores the misidentified type. application/xml is in content_types_to_serve_as_binary and text/plain is not in content_types_allowed_inline, so such HTML is served with Content-Disposition: attachment and downloads instead of previewing inline. This is a regression from marcel 1.1.0, which returned text/html for all of these inputs. (It also reproduces on the released 1.2.1, which doesn't contain #146 at all.)

Suggested direction

Add a start-anchored magic that tolerates a leading BOM / whitespace / <?xml …?> declaration / HTML comments before the first tag — essentially the start-anchored regex proposed in #149, extended to allow an optional XML prolog — so detection doesn't depend on the document's end (which partial-read consumers like Active Storage never see). Keeping the end-of-document </html> check as an additional signal is fine.

Versions: marcel main (#146) and 1.2.1; Ruby 4.0.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions