Skip to content

FileTypeRouter silently drops MIME types containing "+" (e.g. image/svg+xml) into "unclassified" #11647

@Aarkin7

Description

@Aarkin7

Describe the bug
FileTypeRouter.init compiles every entry in mime_types as a regex without re.escape, even though the docstring promises that MIME types without regex patterns are matched exactly. As a result:

Any IANA MIME type containing + (e.g. image/svg+xml, application/ld+json, every application/+xml and application/+json) fails to match itself and silently falls into the unclassified bucket: no error, no warning.
The . in literals like application/pdf acts as a regex wildcard, so unrelated strings such as applicationXpdf match the wrong bucket (cross-contamination).

Both Path sources (via mimetypes.guess_type) and ByteStream sources (via source.mime_type) are affected.

Error message
None. The component does not raise or log anything, sources just end up in the wrong bucket, and the matching socket gets nothing.

Expected behavior
Strings that look like literal MIME types should match character-for-character, as the docstring already promises. FileTypeRouter(mime_types=["image/svg+xml"]) should route an image/svg+xml stream into the "image/svg+xml" bucket, not "unclassified". Explicit regex patterns like r"audio/.*" should keep working.

Additional context
Downstream effect: any pipeline that fans out to a SVG/JSON-LD/RSS/Atom converter via FileTypeRouter receives an empty socket and the document is silently lost. The bug is in haystack/components/routers/file_type_router.py at the init regex-compile loop and the run match loop.

To Reproduce

from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.dataclasses import ByteStream

# 1. Literal "+" MIME silently dropped
router = FileTypeRouter(mime_types=["image/svg+xml"])
out = router.run(sources=[ByteStream(b"<svg/>", mime_type="image/svg+xml")])
print(out)
# Actual:   {'unclassified': [...]}
# Expected: {'image/svg+xml': [...]}

# 2. Cross-contamination from "." wildcard
router = FileTypeRouter(mime_types=["application/pdf"])
out = router.run(sources=[ByteStream(b"x", mime_type="applicationXpdf")])
print(out)
# Actual:   {'application/pdf': [...]}   # wrong — matched the wildcard
# Expected: {'unclassified': [...]}

FAQ Check

System:

  • OS: macOS 26.4.1
  • Haystack version: main @ 2bc3dc9 (also reproduces on the latest released 2.x)

Metadata

Metadata

Assignees

Labels

P3Low priority, leave it in the backlog

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions