Describe the bug
When HTMLToDocument receives a ByteStream with data=b"" (empty), it calls trafilatura.extract() on the empty bytes. This triggers lxml to emit several ERROR/WARNING level log events that surface as noise in production logs, even though the component already has a try/except that gracefully skips the source.
Error message
lxml parsing failed: Document is empty
lxml parser bytestring Document is empty
empty HTML tree: None
discarding data: None
The errors appear in pairs (lxml fires them during multiple internal parse attempts), and the surrounding pipeline continues normally, but the log output looks like a serious failure.
Expected behavior
HTMLToDocument should silently skip (or log at DEBUG/WARNING level) any ByteStream whose data is empty, without triggering lxml at all.
To Reproduce
from haystack.components.converters.html import HTMLToDocument
from haystack.dataclasses import ByteStream
converter = HTMLToDocument()
# Simulate what LinkContentFetcher emits on failure/redirect with raise_on_failure=False
empty_stream = ByteStream(data=b"")
empty_stream.mime_type = "text/html"
result = converter.run(sources=[empty_stream])
# Four lxml ERROR/WARNING log lines are emitted even though result == {"documents": []}
FAQ Check
System:
- Haystack version: haystack-ai 2.28.0
Describe the bug
When HTMLToDocument receives a ByteStream with data=b"" (empty), it calls trafilatura.extract() on the empty bytes. This triggers lxml to emit several ERROR/WARNING level log events that surface as noise in production logs, even though the component already has a try/except that gracefully skips the source.
Error message
The errors appear in pairs (lxml fires them during multiple internal parse attempts), and the surrounding pipeline continues normally, but the log output looks like a serious failure.
Expected behavior
HTMLToDocument should silently skip (or log at DEBUG/WARNING level) any ByteStream whose data is empty, without triggering lxml at all.
To Reproduce
FAQ Check
System: