The simplest, clearest implementation of a vector database from scratch. No Pinecone. No ChromaDB. No sentence-transformers. Just numpy.
Built as a learning resource. Every line is readable, every concept is spelled out. If you ever wondered what is actually inside a vector database, this is it.
git clone https://github.com/shivnathtathe/nanoVectorDB
cd nanoVectorDB
pip install numpy
python main.pya complete vector database in 130 lines:
- tokenizer: split, lowercase, strip punctuation
- vocabulary: unique integer id per word
- co-occurrence matrix: meaning as neighborhood counts
- SVD embedder: compress 130-dim sparse rows into 32-dim dense vectors
- document embedder: average word vectors into sentence vectors
- vector store: insert and brute force cosine search
split on whitespace. lowercase. strip punctuation.
every unique token gets one integer id. repeated words are ignored.
for every word, count how often each other word appears within a window of 3. a word is defined by the company it keeps.
decompose the 130x130 co-occurrence matrix into themes. keep the top 32 dimensions. throw away the noise. L2 normalize so dot product == cosine similarity.
average the word vectors of all tokens in a sentence. L2 normalize.
insert stores (vector, text) pairs. search is brute force dot products O(n*d).
vector vs search -> 0.9524
training vs loss -> 0.3913
attention vs model -> 0.2742
we never told the system these words are related. it figured that out purely by counting co-occurrences across 20 sentences.
word order is invisible. "the model encodes the input" and "the input encodes the model" produce identical vectors. negation is invisible. "not about transformers" still returns transformer results. unknown words are silently dropped.
these three failures motivated everything from Word2Vec to BERT to GPT.
this repo uses brute force search O(n*d). for scale, replace with HNSW (hierarchical navigable small world graphs). that is the algorithm inside every production vector DB. next video: we build HNSW from scratch.
Let's build a Vector Database: from scratch, in code, spelled out.
MIT






