Skip to content

shivnathtathe/nanoVectorDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanoVectorDB

The simplest, clearest implementation of a vector database from scratch. No Pinecone. No ChromaDB. No sentence-transformers. Just numpy.

Built as a learning resource. Every line is readable, every concept is spelled out. If you ever wondered what is actually inside a vector database, this is it.

install

git clone https://github.com/shivnathtathe/nanoVectorDB
cd nanoVectorDB
pip install numpy
python main.py

what it builds

pipeline

a complete vector database in 130 lines:

  • tokenizer: split, lowercase, strip punctuation
  • vocabulary: unique integer id per word
  • co-occurrence matrix: meaning as neighborhood counts
  • SVD embedder: compress 130-dim sparse rows into 32-dim dense vectors
  • document embedder: average word vectors into sentence vectors
  • vector store: insert and brute force cosine search

1. tokenization

split on whitespace. lowercase. strip punctuation.

tokenization

2. vocabulary

every unique token gets one integer id. repeated words are ignored.

vocabulary

3. co-occurrence matrix

for every word, count how often each other word appears within a window of 3. a word is defined by the company it keeps.

co-occurrence

4. SVD

decompose the 130x130 co-occurrence matrix into themes. keep the top 32 dimensions. throw away the noise. L2 normalize so dot product == cosine similarity.

svd

5. document embedder

average the word vectors of all tokens in a sentence. L2 normalize.

document embedder

6. vector store

insert stores (vector, text) pairs. search is brute force dot products O(n*d).

vector store

results

vector   vs search   -> 0.9524
training vs loss     -> 0.3913
attention vs model   -> 0.2742

we never told the system these words are related. it figured that out purely by counting co-occurrences across 20 sentences.

where it breaks

word order is invisible. "the model encodes the input" and "the input encodes the model" produce identical vectors. negation is invisible. "not about transformers" still returns transformer results. unknown words are silently dropped.

these three failures motivated everything from Word2Vec to BERT to GPT.

what is next

this repo uses brute force search O(n*d). for scale, replace with HNSW (hierarchical navigable small world graphs). that is the algorithm inside every production vector DB. next video: we build HNSW from scratch.

video

Let's build a Vector Database: from scratch, in code, spelled out.

license

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages