Skip to content

fix: InMemoryDocumentStore.load_from_disk corrupts documents with blob or sparse_embedding (loaded as raw dicts) #11593

@Ayushhgit

Description

@Ayushhgit

Describe the bug
InMemoryDocumentStore.load_from_disk reconstructs documents with the plain Document constructor instead of Document.from_dict:

cls_object.write_documents(
    documents=[Document(**doc) for doc in documents], policy=DuplicatePolicy.OVERWRITE
)

save_to_disk serializes documents with Document.to_dict(flatten=False), which converts nested objects to plain dicts (blobByteStream.to_dict(), sparse_embeddingSparseEmbedding.to_dict()). Document.from_dict is the inverse and restores the proper types — but the constructor performs no such conversion. So any document saved with a blob or a sparse_embedding is loaded back with those fields as raw dicts.

The corrupted documents then break in cascade:

blob type after load: dict (expected ByteStream)
sparse type after load: dict (expected SparseEmbedding)
repr(doc)     -> AttributeError: 'dict' object has no attribute 'data'
doc.to_dict() -> AttributeError: 'dict' object has no attribute 'to_dict'
doc == other  -> AttributeError: 'dict' object has no attribute 'to_dict'
store.save_to_disk(...) on the reloaded store -> AttributeError: 'dict' object has no attribute 'to_dict'

Anything that touches document.blob.data downstream (e.g. image pipelines via DocumentToImageContent) crashes too, and a save → load → save cycle is impossible.

Error message
AttributeError: 'dict' object has no attribute 'data' / AttributeError: 'dict' object has no attribute 'to_dict'

Expected behavior
load_from_disk restores documents exactly as they were saved: blob as ByteStream, sparse_embedding as SparseEmbedding; the loaded documents compare equal to the originals and the store can be saved again.

To Reproduce

import tempfile
from pathlib import Path

from haystack.dataclasses import ByteStream, Document, SparseEmbedding
from haystack.document_stores.in_memory import InMemoryDocumentStore

doc = Document(
    content="image doc",
    blob=ByteStream(data=b"\x89PNG fake image bytes", mime_type="image/png"),
    sparse_embedding=SparseEmbedding(indices=[0, 5], values=[0.1, 0.9]),
)

store = InMemoryDocumentStore()
store.write_documents([doc])

path = str(Path(tempfile.gettempdir()) / "store.json")
store.save_to_disk(path)
loaded = InMemoryDocumentStore.load_from_disk(path)
ldoc = loaded.filter_documents()[0]

print(type(ldoc.blob))              # <class 'dict'>
repr(ldoc)                          # AttributeError

FAQ Check

System:

  • OS: Windows 11
  • Haystack version: main (2.x)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions