nanoVectorDB

The simplest, clearest implementation of a vector database from scratch. No Pinecone. No ChromaDB. No sentence-transformers. Just numpy.

Built as a learning resource. Every line is readable, every concept is spelled out. If you ever wondered what is actually inside a vector database, this is it.

install

git clone https://github.com/shivnathtathe/nanoVectorDB
cd nanoVectorDB
pip install numpy
python main.py

what it builds

a complete vector database in 130 lines:

tokenizer: split, lowercase, strip punctuation
vocabulary: unique integer id per word
co-occurrence matrix: meaning as neighborhood counts
SVD embedder: compress 130-dim sparse rows into 32-dim dense vectors
document embedder: average word vectors into sentence vectors
vector store: insert and brute force cosine search

1. tokenization

split on whitespace. lowercase. strip punctuation.

2. vocabulary

every unique token gets one integer id. repeated words are ignored.

3. co-occurrence matrix

for every word, count how often each other word appears within a window of 3. a word is defined by the company it keeps.

4. SVD

decompose the 130x130 co-occurrence matrix into themes. keep the top 32 dimensions. throw away the noise. L2 normalize so dot product == cosine similarity.

5. document embedder

average the word vectors of all tokens in a sentence. L2 normalize.

6. vector store

insert stores (vector, text) pairs. search is brute force dot products O(n*d).

results

vector   vs search   -> 0.9524
training vs loss     -> 0.3913
attention vs model   -> 0.2742

we never told the system these words are related. it figured that out purely by counting co-occurrences across 20 sentences.

where it breaks

word order is invisible. "the model encodes the input" and "the input encodes the model" produce identical vectors. negation is invisible. "not about transformers" still returns transformer results. unknown words are silently dropped.

these three failures motivated everything from Word2Vec to BERT to GPT.

what is next

this repo uses brute force search O(n*d). for scale, replace with HNSW (hierarchical navigable small world graphs). that is the algorithm inside every production vector DB. next video: we build HNSW from scratch.

video

Let's build a Vector Database: from scratch, in code, spelled out.

license

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
main.py		main.py
nanovectordb.py		nanovectordb.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoVectorDB

install

what it builds

1. tokenization

2. vocabulary

3. co-occurrence matrix

4. SVD

5. document embedder

6. vector store

results

where it breaks

what is next

video

license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoVectorDB

install

what it builds

1. tokenization

2. vocabulary

3. co-occurrence matrix

4. SVD

5. document embedder

6. vector store

results

where it breaks

what is next

video

license

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages