Multimodal RAG agent using Gemini embeddings, ChromaDB, and BAML.
Based on Karpathy's LLM wiki sketch.
- Docker
cp .env.example .env
# edit .env: set GEMINI_API_KEY (required), BOUNDARY_API_KEY (optional, for tracing)
docker compose up --buildApp serves at http://localhost:8000. Ingested files persist to ./data, Chroma to a named volume.
- Open http://localhost:8000
- Upload documents in the Ingest tab (multimodal: PDF, text, markdown, audio, video)
- Chat with the agent in the Threads tab — it'll search, read, write, and edit wiki pages grounded in your uploads
System overview
Ingestion pipeline — how uploaded files are chunked, embedded, and stored per user.
Agent tools — how the agent loop decides between semantic_search, read, write, and edit, and how each call is sandboxed to the user.
The main system prompt lives in baml_src/agent.baml.
Test assets for trying out the ingest pipeline can be found in docs/assets/. They include a PDF, markdown file, image, audio clip, and video clip.
- Multimodal input in chat — let users paste images, audio, and video directly into a turn, not just during ingest.
- Conversation-level evals — build an eval harness over full threads (not just single turns) to iterate on the agent prompt with signal.


