Follow-up to #102 / #149 (fixed by #146 — thanks!). #146 fully fixes full-document detection, but the new heuristic has a gap for consumers that identify a type from a leading byte range rather than the whole file.
#146 detects HTML by matching either the start (\A\s*<(!DOCTYPE html|html)) or the end (</html>\s*\z) of the IO. The end-of-document check is what rescues documents whose first bytes aren't a bare tag — i.e. a leading HTML comment, a UTF-8 BOM, or an <?xml …?> declaration.
The problem: Active Storage identifies content types from only the first 4 KB of the blob. ActiveStorage::Blob::Identifiable#identify_content_type does:
Marcel::MimeType.for(
service.download_chunk(key, 0...4.kilobytes),
name: filename.to_s,
declared_type: content_type
)
A partial reader like this never sees the trailing </html>. So for any HTML file larger than 4 KB whose first bytes aren't a bare tag (leading comment / BOM / <?xml> prolog), neither magic matches within the chunk and detection falls through to application/xml (or text/plain for a BOM).
Passing name: / declared_type: "text/html" does not rescue it, because most_specific_type keeps the magic result — text/html is not a registered child of application/xml / text/plain.
Repro (against main @ 7e2cbc1, i.e. with #146)
require "marcel"; require "stringio"
# An HTML document > 4 KB that begins with a comment.
# (Same result with a leading UTF-8 BOM or an `<?xml ...?>` declaration.)
html = "<!-- leading comment, padded past 128 chars: " + ("a" * 90) + " -->\n" \
"<html><body>\n" + ("<p>x</p>\n" * 600) + "</body></html>" # ~5.5 KB
chunk = html.byteslice(0, 4096) # what Active Storage feeds Marcel
Marcel::MimeType.for(StringIO.new(html)) # => "text/html" ✅ fixed by #146
Marcel::MimeType.for(StringIO.new(chunk), name: "page.html", declared_type: "text/html") # => "application/xml" ❌
Found in the wild with a 218 KB HTML file (a generated UI mockup that prepends a <!-- … --> description on line 1, and ends with </html>): on main, for(full) → text/html but for(first 4 KB) → application/xml.
Real-world impact
Active Storage stores the misidentified type. application/xml is in content_types_to_serve_as_binary and text/plain is not in content_types_allowed_inline, so such HTML is served with Content-Disposition: attachment and downloads instead of previewing inline. This is a regression from marcel 1.1.0, which returned text/html for all of these inputs. (It also reproduces on the released 1.2.1, which doesn't contain #146 at all.)
Suggested direction
Add a start-anchored magic that tolerates a leading BOM / whitespace / <?xml …?> declaration / HTML comments before the first tag — essentially the start-anchored regex proposed in #149, extended to allow an optional XML prolog — so detection doesn't depend on the document's end (which partial-read consumers like Active Storage never see). Keeping the end-of-document </html> check as an additional signal is fine.
Versions: marcel main (#146) and 1.2.1; Ruby 4.0.5.
Follow-up to #102 / #149 (fixed by #146 — thanks!). #146 fully fixes full-document detection, but the new heuristic has a gap for consumers that identify a type from a leading byte range rather than the whole file.
#146 detects HTML by matching either the start (
\A\s*<(!DOCTYPE html|html)) or the end (</html>\s*\z) of the IO. The end-of-document check is what rescues documents whose first bytes aren't a bare tag — i.e. a leading HTML comment, a UTF-8 BOM, or an<?xml …?>declaration.The problem: Active Storage identifies content types from only the first 4 KB of the blob.
ActiveStorage::Blob::Identifiable#identify_content_typedoes:A partial reader like this never sees the trailing
</html>. So for any HTML file larger than 4 KB whose first bytes aren't a bare tag (leading comment / BOM /<?xml>prolog), neither magic matches within the chunk and detection falls through toapplication/xml(ortext/plainfor a BOM).Passing
name:/declared_type: "text/html"does not rescue it, becausemost_specific_typekeeps the magic result —text/htmlis not a registered child ofapplication/xml/text/plain.Repro (against
main@7e2cbc1, i.e. with #146)Found in the wild with a 218 KB HTML file (a generated UI mockup that prepends a
<!-- … -->description on line 1, and ends with</html>): onmain,for(full)→text/htmlbutfor(first 4 KB)→application/xml.Real-world impact
Active Storage stores the misidentified type.
application/xmlis incontent_types_to_serve_as_binaryandtext/plainis not incontent_types_allowed_inline, so such HTML is served withContent-Disposition: attachmentand downloads instead of previewing inline. This is a regression from marcel 1.1.0, which returnedtext/htmlfor all of these inputs. (It also reproduces on the released 1.2.1, which doesn't contain #146 at all.)Suggested direction
Add a start-anchored magic that tolerates a leading BOM / whitespace /
<?xml …?>declaration / HTML comments before the first tag — essentially the start-anchored regex proposed in #149, extended to allow an optional XML prolog — so detection doesn't depend on the document's end (which partial-read consumers like Active Storage never see). Keeping the end-of-document</html>check as an additional signal is fine.Versions: marcel
main(#146) and 1.2.1; Ruby 4.0.5.