Describe the bug
InMemoryDocumentStore.load_from_disk reconstructs documents with the plain Document constructor instead of Document.from_dict:
cls_object.write_documents(
documents=[Document(**doc) for doc in documents], policy=DuplicatePolicy.OVERWRITE
)
save_to_disk serializes documents with Document.to_dict(flatten=False), which converts nested objects to plain dicts (blob → ByteStream.to_dict(), sparse_embedding → SparseEmbedding.to_dict()). Document.from_dict is the inverse and restores the proper types — but the constructor performs no such conversion. So any document saved with a blob or a sparse_embedding is loaded back with those fields as raw dicts.
The corrupted documents then break in cascade:
blob type after load: dict (expected ByteStream)
sparse type after load: dict (expected SparseEmbedding)
repr(doc) -> AttributeError: 'dict' object has no attribute 'data'
doc.to_dict() -> AttributeError: 'dict' object has no attribute 'to_dict'
doc == other -> AttributeError: 'dict' object has no attribute 'to_dict'
store.save_to_disk(...) on the reloaded store -> AttributeError: 'dict' object has no attribute 'to_dict'
Anything that touches document.blob.data downstream (e.g. image pipelines via DocumentToImageContent) crashes too, and a save → load → save cycle is impossible.
Error message
AttributeError: 'dict' object has no attribute 'data' / AttributeError: 'dict' object has no attribute 'to_dict'
Expected behavior
load_from_disk restores documents exactly as they were saved: blob as ByteStream, sparse_embedding as SparseEmbedding; the loaded documents compare equal to the originals and the store can be saved again.
To Reproduce
import tempfile
from pathlib import Path
from haystack.dataclasses import ByteStream, Document, SparseEmbedding
from haystack.document_stores.in_memory import InMemoryDocumentStore
doc = Document(
content="image doc",
blob=ByteStream(data=b"\x89PNG fake image bytes", mime_type="image/png"),
sparse_embedding=SparseEmbedding(indices=[0, 5], values=[0.1, 0.9]),
)
store = InMemoryDocumentStore()
store.write_documents([doc])
path = str(Path(tempfile.gettempdir()) / "store.json")
store.save_to_disk(path)
loaded = InMemoryDocumentStore.load_from_disk(path)
ldoc = loaded.filter_documents()[0]
print(type(ldoc.blob)) # <class 'dict'>
repr(ldoc) # AttributeError
FAQ Check
System:
- OS: Windows 11
- Haystack version: main (2.x)
Describe the bug
InMemoryDocumentStore.load_from_diskreconstructs documents with the plainDocumentconstructor instead ofDocument.from_dict:save_to_diskserializes documents withDocument.to_dict(flatten=False), which converts nested objects to plain dicts (blob→ByteStream.to_dict(),sparse_embedding→SparseEmbedding.to_dict()).Document.from_dictis the inverse and restores the proper types — but the constructor performs no such conversion. So any document saved with ablobor asparse_embeddingis loaded back with those fields as raw dicts.The corrupted documents then break in cascade:
Anything that touches
document.blob.datadownstream (e.g. image pipelines viaDocumentToImageContent) crashes too, and a save → load → save cycle is impossible.Error message
AttributeError: 'dict' object has no attribute 'data'/AttributeError: 'dict' object has no attribute 'to_dict'Expected behavior
load_from_diskrestores documents exactly as they were saved:blobasByteStream,sparse_embeddingasSparseEmbedding; the loaded documents compare equal to the originals and the store can be saved again.To Reproduce
FAQ Check
System: