Skip to content

fix: HTMLToDocument generates noisy ERROR logs when receiving empty ByteStream objects #11668

@deep-rloebbert

Description

@deep-rloebbert

Describe the bug
When HTMLToDocument receives a ByteStream with data=b"" (empty), it calls trafilatura.extract() on the empty bytes. This triggers lxml to emit several ERROR/WARNING level log events that surface as noise in production logs, even though the component already has a try/except that gracefully skips the source.

Error message

  lxml parsing failed: Document is empty                                                                                                    
  lxml parser bytestring Document is empty                                                                                                  
  empty HTML tree: None                                                                                                                     
  discarding data: None                                                                                                                     

The errors appear in pairs (lxml fires them during multiple internal parse attempts), and the surrounding pipeline continues normally, but the log output looks like a serious failure.

Expected behavior
HTMLToDocument should silently skip (or log at DEBUG/WARNING level) any ByteStream whose data is empty, without triggering lxml at all.

To Reproduce

from haystack.components.converters.html import HTMLToDocument                                                                            
from haystack.dataclasses import ByteStream                                                                                               
                                                                                                                                            
converter = HTMLToDocument()                                                                                                              
# Simulate what LinkContentFetcher emits on failure/redirect with raise_on_failure=False                                                  
empty_stream = ByteStream(data=b"")                                                                                                       
empty_stream.mime_type = "text/html"                                                                                                      
                                                                                                                                            
result = converter.run(sources=[empty_stream])                                                                                            
# Four lxml ERROR/WARNING log lines are emitted even though result == {"documents": []}                                                   

FAQ Check

System:

  • Haystack version: haystack-ai 2.28.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions