fix: HTMLToDocument generates noisy ERROR logs when receiving empty ByteStream objects

**Describe the bug**
When HTMLToDocument receives a ByteStream with data=b"" (empty), it calls trafilatura.extract() on the empty bytes. This triggers lxml to emit several ERROR/WARNING level log events that surface as noise in production logs, even though the component already has a try/except that gracefully skips the source.

**Error message**
 ```                                                                                                                                         
   lxml parsing failed: Document is empty                                                                                                    
   lxml parser bytestring Document is empty                                                                                                  
   empty HTML tree: None                                                                                                                     
   discarding data: None                                                                                                                     
 ```                                                                                                                                         
                                                                                                                                             
The errors appear in pairs (lxml fires them during multiple internal parse attempts), and the surrounding pipeline continues normally, but the log output looks like a serious failure. 

**Expected behavior**
HTMLToDocument should silently skip (or log at DEBUG/WARNING level) any ByteStream whose data is empty, without triggering lxml at all.


**To Reproduce**
 ```python                                                                                                                                   
from haystack.components.converters.html import HTMLToDocument                                                                            
from haystack.dataclasses import ByteStream                                                                                               
                                                                                                                                             
 converter = HTMLToDocument()                                                                                                              
 # Simulate what LinkContentFetcher emits on failure/redirect with raise_on_failure=False                                                  
 empty_stream = ByteStream(data=b"")                                                                                                       
 empty_stream.mime_type = "text/html"                                                                                                      
                                                                                                                                             
 result = converter.run(sources=[empty_stream])                                                                                            
 # Four lxml ERROR/WARNING log lines are emitted even though result == {"documents": []}                                                   
 ```    

**FAQ Check**
- [x] Have you had a look at [our new FAQ page](https://docs.haystack.deepset.ai/docs/faq)?

**System:**
- Haystack version: haystack-ai 2.28.0 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: HTMLToDocument generates noisy ERROR logs when receiving empty ByteStream objects #11668

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix: HTMLToDocument generates noisy ERROR logs when receiving empty ByteStream objects #11668

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions