A non-developer evaluator on a 256GB Mac Mini hit source-blob storage anxiety and asked explicitly for a path-only ingestion mode: register a source by its file path WITHOUT copying its bytes into Neotoma (his term: "the notion of the PDF"). Current behavior (confirmed in MCP_SPEC: file_path reads + ingests bytes; FILE_NOT_FOUND/FILE_READ_ERROR) always re-ingests. Mark committed to investigate.
Note: v0.17.0 discard-by-default intake (overflow sink, collapse_by sightings) addresses firehose-pollution but is NOT this ask — that discards; this needs a durable by-reference source row. Distinct from the disk-to-entity write-back feature (bidirectional mirror sync). His strongest open source-storage pain.
Design
Goal: a durable, queryable source row that records WHERE the bytes are, not the bytes — so a large/local file is first-class in the graph without inflating the DB.
API. store (and parse_file) gain source_storage: "inline" | "reference" (default "inline" = today's behavior; fully backward-compatible). With "reference", the server reads the file ONCE to compute content_hash (SHA-256) and metadata, then persists a sources row WITHOUT the blob bytes.
Sources row (additive columns):
storage_mode: "inline" | "reference" (default inline)
path (absolute), host_id (which machine owns the path), size_bytes, mime_type, mtime
content_hash retained — so content-addressing, dedup, and interpretation linkage all work unchanged.
Why read-to-hash (not client-supplied hash): preserves content-addressed dedup + integrity at near-zero storage cost (hash a stream, discard bytes). A client hash is an optional fast-path but must be verifiable.
Retrieval. retrieve_file_url / byte fetch resolves path on host_id at read time. If the file is gone/moved → structured SOURCE_UNAVAILABLE (returns path + last-known hash + host_id), never a misleading empty blob. Optional re-hash on access detects drift → SOURCE_REFERENCE_STALE warning (warn-first posture).
Size-management story: reference mode stores only metadata; bytes stay on disk. Tradeoff: NOT portable across machines (host-local), availability depends on the file staying put — in exchange for zero DB bloat. inline remains the default for anything that must be portable/durable in Neotoma itself.
Boundaries / non-goals: not a sync mechanism; not the overflow sink (that discards). Interpretations/observations reference a reference source exactly like an inline one.
Open questions: multi-host resolution (host_id registry); allowed-roots / path-traversal security; GC + orphan detection; behavior when an interpretation needs bytes that are now unavailable.
Rollout: additive columns + a new enum value; default inline ⇒ no behavior change for existing callers.
Surfaced from developer-release evaluator feedback. Tracked in Neotoma as ent_7612a94d2b6274e10875e930 (private).
A non-developer evaluator on a 256GB Mac Mini hit source-blob storage anxiety and asked explicitly for a path-only ingestion mode: register a source by its file path WITHOUT copying its bytes into Neotoma (his term: "the notion of the PDF"). Current behavior (confirmed in MCP_SPEC:
file_pathreads + ingests bytes;FILE_NOT_FOUND/FILE_READ_ERROR) always re-ingests. Mark committed to investigate.Note: v0.17.0 discard-by-default intake (overflow sink,
collapse_bysightings) addresses firehose-pollution but is NOT this ask — that discards; this needs a durable by-reference source row. Distinct from the disk-to-entity write-back feature (bidirectional mirror sync). His strongest open source-storage pain.Design
Goal: a durable, queryable source row that records WHERE the bytes are, not the bytes — so a large/local file is first-class in the graph without inflating the DB.
API.
store(andparse_file) gainsource_storage: "inline" | "reference"(default"inline"= today's behavior; fully backward-compatible). With"reference", the server reads the file ONCE to computecontent_hash(SHA-256) and metadata, then persists asourcesrow WITHOUT the blob bytes.Sources row (additive columns):
storage_mode: "inline" | "reference"(default inline)path(absolute),host_id(which machine owns the path),size_bytes,mime_type,mtimecontent_hashretained — so content-addressing, dedup, and interpretation linkage all work unchanged.Why read-to-hash (not client-supplied hash): preserves content-addressed dedup + integrity at near-zero storage cost (hash a stream, discard bytes). A client hash is an optional fast-path but must be verifiable.
Retrieval.
retrieve_file_url/ byte fetch resolvespathonhost_idat read time. If the file is gone/moved → structuredSOURCE_UNAVAILABLE(returns path + last-known hash + host_id), never a misleading empty blob. Optional re-hash on access detects drift →SOURCE_REFERENCE_STALEwarning (warn-first posture).Size-management story: reference mode stores only metadata; bytes stay on disk. Tradeoff: NOT portable across machines (host-local), availability depends on the file staying put — in exchange for zero DB bloat.
inlineremains the default for anything that must be portable/durable in Neotoma itself.Boundaries / non-goals: not a sync mechanism; not the overflow sink (that discards). Interpretations/observations reference a
referencesource exactly like aninlineone.Open questions: multi-host resolution (host_id registry); allowed-roots / path-traversal security; GC + orphan detection; behavior when an interpretation needs bytes that are now unavailable.
Rollout: additive columns + a new enum value; default
inline⇒ no behavior change for existing callers.Surfaced from developer-release evaluator feedback. Tracked in Neotoma as
ent_7612a94d2b6274e10875e930(private).