Describe the bug
FileTypeRouter.init compiles every entry in mime_types as a regex without re.escape, even though the docstring promises that MIME types without regex patterns are matched exactly. As a result:
Any IANA MIME type containing + (e.g. image/svg+xml, application/ld+json, every application/+xml and application/+json) fails to match itself and silently falls into the unclassified bucket: no error, no warning.
The . in literals like application/pdf acts as a regex wildcard, so unrelated strings such as applicationXpdf match the wrong bucket (cross-contamination).
Both Path sources (via mimetypes.guess_type) and ByteStream sources (via source.mime_type) are affected.
Error message
None. The component does not raise or log anything, sources just end up in the wrong bucket, and the matching socket gets nothing.
Expected behavior
Strings that look like literal MIME types should match character-for-character, as the docstring already promises. FileTypeRouter(mime_types=["image/svg+xml"]) should route an image/svg+xml stream into the "image/svg+xml" bucket, not "unclassified". Explicit regex patterns like r"audio/.*" should keep working.
Additional context
Downstream effect: any pipeline that fans out to a SVG/JSON-LD/RSS/Atom converter via FileTypeRouter receives an empty socket and the document is silently lost. The bug is in haystack/components/routers/file_type_router.py at the init regex-compile loop and the run match loop.
To Reproduce
from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.dataclasses import ByteStream
# 1. Literal "+" MIME silently dropped
router = FileTypeRouter(mime_types=["image/svg+xml"])
out = router.run(sources=[ByteStream(b"<svg/>", mime_type="image/svg+xml")])
print(out)
# Actual: {'unclassified': [...]}
# Expected: {'image/svg+xml': [...]}
# 2. Cross-contamination from "." wildcard
router = FileTypeRouter(mime_types=["application/pdf"])
out = router.run(sources=[ByteStream(b"x", mime_type="applicationXpdf")])
print(out)
# Actual: {'application/pdf': [...]} # wrong — matched the wildcard
# Expected: {'unclassified': [...]}
FAQ Check
System:
- OS: macOS 26.4.1
- Haystack version: main @ 2bc3dc9 (also reproduces on the latest released 2.x)
Describe the bug
FileTypeRouter.init compiles every entry in mime_types as a regex without re.escape, even though the docstring promises that MIME types without regex patterns are matched exactly. As a result:
Any IANA MIME type containing + (e.g. image/svg+xml, application/ld+json, every application/+xml and application/+json) fails to match itself and silently falls into the unclassified bucket: no error, no warning.
The . in literals like application/pdf acts as a regex wildcard, so unrelated strings such as applicationXpdf match the wrong bucket (cross-contamination).
Both Path sources (via mimetypes.guess_type) and ByteStream sources (via source.mime_type) are affected.
Error message
None. The component does not raise or log anything, sources just end up in the wrong bucket, and the matching socket gets nothing.
Expected behavior
Strings that look like literal MIME types should match character-for-character, as the docstring already promises. FileTypeRouter(mime_types=["image/svg+xml"]) should route an image/svg+xml stream into the "image/svg+xml" bucket, not "unclassified". Explicit regex patterns like r"audio/.*" should keep working.
Additional context
Downstream effect: any pipeline that fans out to a SVG/JSON-LD/RSS/Atom converter via FileTypeRouter receives an empty socket and the document is silently lost. The bug is in haystack/components/routers/file_type_router.py at the init regex-compile loop and the run match loop.
To Reproduce
FAQ Check
System: