A fully local RAG agent over your Notion workspace β no cloud APIs, no subscriptions, no data leaving your machine.
A local-first "second brain" agent. Ingests Notion exports (and if needed live Notion pages), processes PDFs/images into markdown, indexes everything into Qdrant, and serves RAG queries via CLI (without memory) or Streamlit GUI (with memory within and between runs). All inference runs locally through Ollama. Designed to run well on native Windows.
Built as a complete local RAG stack: hybrid dense+sparse retrieval, cross-encoder reranking, sentence-aware chunking (with atomic code/table handling), file-based persistent memory, an anchored-rubric evaluation harness, and (optionally) Phoenix observability β all wired together with pydantic-ai. Optimised for a single 12 GB VRAM / 32 GB RAM system.
I use Notion heavily for knowledge management, collecting and organizing my thoughts, notes, and research. As my notes have grown, it has become increasingly beneficial to summarize key sections or pages. While Notion includes a native AI package, I wanted to avoid it due to security, privacy and cost concerns. So I built this instead.
- π Hybrid retrieval β dense (nomic-embed-text) + sparse (BM25) with RRF fusion
- π Cross-encoder reranking β BAAI/bge-reranker-v2-m3 for precision on top of recall
- βοΈ Sentence-aware chunking β atomic handling of code blocks and tables (v4)
- π§ Persistent memory β file-based per-session and long-term memory injected at runtime
- π Evaluation harness β hand-rolled + pydantic_evals with 4-criterion anchored rubric
- π Phoenix observability β optional OTel tracing via Arize Phoenix
- π Fully local β Ollama inference, Qdrant vector store, zero cloud dependencies
docker compose up -d
python -m venv .venv && .venv\Scripts\activate # Windows
# python -m venv .venv && source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
ollama pull nomic-embed-text # or pull any suitable embedding model
ollama pull gemma4:latest # or pull any agentic model
python scripts/run_rag.py # index existing data/clean/
python -m assistant.cli # start chatting in CLI (no memory)
streamlit run assistant/app.py # start chatting in Streamlit (with memory)- Python 3.12+
- Ollama running locally
- Docker for Qdrant + (optionally) Phoenix
- Hardware baseline: 12 GB VRAM, 32 GB RAM (tested on Windows; cross-platform)
- Disk: ~30 GB for Ollama models + Qdrant persistence
For lower-end hardware, adjust model size accordingly.
notion-second-brain/
βββ assistant/ # Agent runtime
β βββ agent.py # pydantic-ai Agent + per-call instantiation
β βββ app.py # Streamlit GUI (with memory)
β βββ cli.py # CLI REPL (no memory)
β βββ memory.py # file-based memory
β βββ tools.py # retrieve_knowledge, fetch_notion_page
βββ pipelines/
β βββ etl/ # Notion/files β raw markdown
β βββ rag/ # chunker, embeddings, reranker, indexer
β βββ utils/
β βββ models.py
βββ scripts/ # Pipeline entry points
β βββ run_marker.py
β βββ run_clean_md.py
β βββ run_etl.py
β βββ run_rag.py
βββ evals/ # Evaluation suite
β βββ cases.py # golden + adversarial + distribution cases
β βββ rubrics.py # 4 anchored 1β5 rubrics
β βββ judges.py # LLM-as-judge (Ollama)
β βββ run_evals.py # hand-rolled runner (canonical)
β βββ run_pydantic_evals.py # pydantic_evals runner
βββ extras/ # Optional / reference
β βββ run_deepeval.py # DeepEval showcase (see deepeval_info.md)
β βββ run_phoenix.py # Phoenix OTel tracing
β βββ deepeval_info.md # DeepEval setup notes
β βββ llm-eval-patterns.md # Eval methodology reference
β βββ prompt-eval-designer.md # Rubric design protocol
βββ memory/ # Conversation memory (gitignored)
βββ data/ # All data files (gitignored)
βββ images/ # README screenshots
βββ docker-compose.yml
βββ requirements.txt
Single venv. One conflict: openai version swap to move between RAG/agent mode and marker mode (see OpenAI conflict).
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txtrequirements.txt pulls torch+cu130 β about 2.5 GB. Expect first install to download ~3β4 GB of wheels.
uv venv .venv && .venv\Scripts\activate
uv pip install -r requirements.txtThe fully project-managed
uvflow (withpyproject.toml+uv.lock) is theoretically cleaner, but encountered setup issues on this stack βuv pipor plainpipis the recommended path until those are resolved.
Relies on the notion-to-md-py library. You need two separate Notion integrations because they serve different code paths:
NOTION_TO_MD_AUTH_TOKENβ used byscripts/run_etl.pyto bulk-fetch pages as markdown during ETL.NOTION_ASSISTANT_AUTH_TOKENβ used by the agent'sfetch_notion_pagetool to live-fetch a single page on demand.
If you don't need live-fetch, skip the second integration.
- Go to https://www.notion.so/profile/integrations β + New integration
- Name it (e.g.
Second Brain β Notion-to-MD) β Read content capabilities only - Copy the Internal Integration Secret (
ntn_...) into.env - Optionally repeat for a second integration (
NOTION_ASSISTANT_AUTH_TOKEN)
For each integration, open each page in Notion β β― β Connections β add the integration. For workspace-wide access: Settings β Connections β add at workspace level.
Without this step the integration returns empty results for every page.
# ββ Notion βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
NOTION_ASSISTANT_AUTH_TOKEN=ntn_...
NOTION_TO_MD_AUTH_TOKEN=ntn_...
# ββ ETL mode βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LOAD_MODE="notion" # "files" or "notion"
# ETL_PAGE_NAME="AI/ML/Data Science" # limit to one page for first-run testing
# ββ Marker (PDF / image OCR) βββββββββββββββββββββββββββββββββββββββββββββββββ
MARKER_USE_LLM=false
MARKER_FORCE_OCR=true
MARKER_WORKERS=2
MARKER_DISABLE_IMAGES=false
# ββ Qdrant βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
QDRANT_URL=http://localhost:32768
FORCE_REINDEX=true # set true on first run, or after schema changes
ENABLE_RERANK=true
# ββ Ollama βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=gemma4:latest
# Use gemma4 for the agent β qwen3 has a large KV-cache offload that
# spills to CPU and tanks throughput. Qwen3 is fine as eval judge.
# ββ Memory βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ENABLE_MEMORY=true
RECENT_LOG_DAYS=2
# ββ Evaluations ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
JUDGE_MODEL=qwen3:8bollama psExpected: 100% GPU. If you see any CPU %, set OLLAMA_NUM_GPU=99 at the shell level (not just .env), or recreate the model via a Modelfile with num_gpu 99.
data/
βββ raw/
β βββ documents/<page-name>/ # raw Notion exports
β βββ images/<page-name>/ # downloaded page images
β βββ raw_md/ # raw Notion text exports
βββ crawled/ # (not yet used) web crawler output
βββ clean/
β βββ pdfs_md/ # PDF β md via marker
β βββ images_md/ # image OCR via marker
β βββ clean_md/ # raw_md β cleaned md
βββ pages.txt # page list for live Notion fetch
memory/
βββ MEMORY.md # long-term distilled context
βββ YYYY-MM-DD.md # daily conversation logs
python scripts/run_etl.pyFetches from Notion (LOAD_MODE=notion) or reads local files (LOAD_MODE=files). Writes to data/raw/raw_md/.
First run: set
ETL_PAGE_NAME="Some Page"to test the Notion connection on a single page before pulling your whole workspace. Notion rate limits are aggressive on large workspaces.
python scripts/run_clean_md.pyLLM-based cleanup of data/raw/raw_md/ β data/clean/clean_md/.
python scripts/run_marker.pyConverts PDFs and images into markdown. Independent of ETL β only run when you have new source files.
Env vars: MARKER_STEP (pdfs / images / all), MARKER_TEST_SUBDIR (debug subset).
Alternatives that avoid the openai conflict: PyMuPDF4LLM (lightest, no OCR), MinerU, Kreuzberg, Docling.
python scripts/run_rag.pyApplies sentence-aware chunking β hybrid embeddings β optional reranking β Qdrant upload with RRF fusion. Collection visible at http://localhost:32768/dashboard#/collections.
Key env vars: FORCE_REINDEX=true (schema changes), RETRIEVAL_TOP_N=10, RERANK_TOP_K=3.
Qdrant indexing note:
indexed_vectors_count: 0withpoints_count > 0is normal β HNSW indexing is deferred until data exceedsindexing_threshold. Lower the threshold inpipelines/rag/indexer.pyfor immediate indexing.
python -m assistant.cli
python -m assistant.cli "What should I focus on this quarter?"streamlit run assistant/app.pyAlways run from the repo root.
File-based memory at memory/:
MEMORY.mdβ long-term distilled context (curated facts, preferences)YYYY-MM-DD.mdβ daily conversation logs
Each turn is appended as:
### HH:MM
User: ...
Assistant: ...
The agent loads MEMORY.md + the most recent RECENT_LOG_DAYS daily logs into its system prompt per call, implemented via per-call Agent instantiation (pydantic-ai 1.x freezes system_prompt after construction).
Toggle: ENABLE_MEMORY=true|false.
Two runners, same anchored rubric and golden test set β run either or both for cross-validation:
| Script | Framework | Purpose |
|---|---|---|
python -m evals.run_evals |
Hand-rolled | Canonical. 4-criterion rubric (relevance, correctness, citation_quality, safety), LLM-as-judge via Ollama. |
python -m evals.run_pydantic_evals |
pydantic_evals | Same dataset + rubric via Evaluator + EvaluationReason. Runs all 3 tiers by default. |
Results written to evals/results/*.json. Set judge model via .env: JUDGE_MODEL=qwen3:8b.
Follows the frameworks in extras/llm-eval-patterns.md and extras/prompt-eval-designer.md β moving from vibes-based to statistically anchored evaluation (pointwise rubrics, 3-tier test suites, CI gates). Built on: ai-engineering-from-scratch.
π Observability β Arize Phoenix (optional)
extras/run_phoenix.py wires Phoenix OTel tracing. Every agent.run(...), LLM call, and tool call appears as a span in the Phoenix UI at http://localhost:6006.
pip install openinference-instrumentation-pydantic-ai opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-api
docker compose up -d phoenix
python -m extras.run_phoenix "your query here"Don't run Phoenix tracing and DeepEval's
DeepEvalInstrumentationSettingssimultaneously β both wrap the same pydantic-ai OTel hooks.
docker compose up -d # starts both Qdrant and Phoenix
docker compose up -d qdrant # Qdrant only
docker compose up -d phoenix # Phoenix onlyservices:
qdrant:
image: qdrant/qdrant:latest
ports:
- "32768:6333"
- "6334:6334"
volumes:
- ./data/qdrant:/qdrant/storage
restart: unless-stopped
phoenix:
image: arizephoenix/phoenix:latest
ports:
- "6006:6006"
- "4317:4317"
volumes:
- ./data/phoenix:/mnt/data
restart: unless-stoppedThe single biggest setup friction. pip check surfaces it as:
marker-pdf 1.10.2 requires openai<2.0.0, but you have openai 2.41.0
marker-pdfpinsopenai<2.0.0pydantic-aipullsopenai>=2.0.0- They are incompatible in the same environment
Workaround: swap versions as needed (~10 seconds):
# Before running marker / ETL scripts
pip install "openai<2.0.0,>=1.65.2"
# Before running the agent / evals
pip install "openai>=2.0.0"Only needed when you have new PDFs or images to OCR. Re-indexing already-cleaned markdown (run_rag.py) doesn't touch marker and doesn't need the swap.
New PDFs / images to OCR? β marker mode (pip install openai<2, run_marker.py)
Re-index existing cleaned markdown? β agent mode (pip install openai>=2, run_rag.py)
Chat / develop / run evals? β agent mode
| Symptom | Fix |
|---|---|
| Notion returns nothing | Pages not shared with integration β open page β β― β Connections β add integration |
| Notion rate limits / 429s | Use ETL_PAGE_NAME="Single Page" for first-run validation |
| GPU offload (CPU %) | Set OLLAMA_NUM_GPU=99 at shell level, or recreate model via Modelfile |
| pydantic-ai 404 against Ollama | Use OllamaProvider(base_url="http://localhost:11434/v1") β the ollama:<model> shorthand routes to the wrong path |
| Streamlit import errors | Always cd to repo root before running |
| Eval JSON missing | Default output: evals/results/*.json; override with --output PATH |
| Memory not used | Check ENABLE_MEMORY=true; memory/MEMORY.md is auto-created on first run |
| Stale chunks after schema change | Set FORCE_REINDEX=true to drop and rebuild the Qdrant collection |
MIT β see LICENSE. This repo is meant to be used, adapted and improved upon based on individual user needs and system capabilities.

