Skip to content

Path-only / by-reference source storage ('the notion of the PDF') — don't re-ingest bytes #1775

Description

@markmhendrickson

A non-developer evaluator on a 256GB Mac Mini hit source-blob storage anxiety and asked explicitly for a path-only ingestion mode: register a source by its file path WITHOUT copying its bytes into Neotoma (his term: "the notion of the PDF"). Current behavior (confirmed in MCP_SPEC: file_path reads + ingests bytes; FILE_NOT_FOUND/FILE_READ_ERROR) always re-ingests. Mark committed to investigate.

Note: v0.17.0 discard-by-default intake (overflow sink, collapse_by sightings) addresses firehose-pollution but is NOT this ask — that discards; this needs a durable by-reference source row. Distinct from the disk-to-entity write-back feature (bidirectional mirror sync). His strongest open source-storage pain.

Design

Goal: a durable, queryable source row that records WHERE the bytes are, not the bytes — so a large/local file is first-class in the graph without inflating the DB.

API. store (and parse_file) gain source_storage: "inline" | "reference" (default "inline" = today's behavior; fully backward-compatible). With "reference", the server reads the file ONCE to compute content_hash (SHA-256) and metadata, then persists a sources row WITHOUT the blob bytes.

Sources row (additive columns):

  • storage_mode: "inline" | "reference" (default inline)
  • path (absolute), host_id (which machine owns the path), size_bytes, mime_type, mtime
  • content_hash retained — so content-addressing, dedup, and interpretation linkage all work unchanged.

Why read-to-hash (not client-supplied hash): preserves content-addressed dedup + integrity at near-zero storage cost (hash a stream, discard bytes). A client hash is an optional fast-path but must be verifiable.

Retrieval. retrieve_file_url / byte fetch resolves path on host_id at read time. If the file is gone/moved → structured SOURCE_UNAVAILABLE (returns path + last-known hash + host_id), never a misleading empty blob. Optional re-hash on access detects drift → SOURCE_REFERENCE_STALE warning (warn-first posture).

Size-management story: reference mode stores only metadata; bytes stay on disk. Tradeoff: NOT portable across machines (host-local), availability depends on the file staying put — in exchange for zero DB bloat. inline remains the default for anything that must be portable/durable in Neotoma itself.

Boundaries / non-goals: not a sync mechanism; not the overflow sink (that discards). Interpretations/observations reference a reference source exactly like an inline one.

Open questions: multi-host resolution (host_id registry); allowed-roots / path-traversal security; GC + orphan detection; behavior when an interpretation needs bytes that are now unavailable.

Rollout: additive columns + a new enum value; default inline ⇒ no behavior change for existing callers.


Surfaced from developer-release evaluator feedback. Tracked in Neotoma as ent_7612a94d2b6274e10875e930 (private).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requestincidentlanius-triageIssue triaged by Lanius workflow coordinatorneotomaquestionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions