BlitzKode

language

en

license

mit

library_name

llama-cpp-python

pipeline_tag

text-generation

BlitzKode

BlitzKode is a local, API-first coding assistant built around a fine-tuned Qwen2.5 model and served with llama-cpp-python. The project is intentionally backend-only: FastAPI exposes generation, streaming, retrieval-augmented generation, health, and metadata endpoints while the model stays on your machine.

Production model: neuralbroker/blitzkode
1.5B LoRA adapter: neuralbroker/blitzkode-1.5b-lora
0.5B LoRA adapter: neuralbroker/blitzkode-lora-0.5b
License: MIT project code; also comply with the upstream Qwen2.5 license for model redistribution.

What this repo contains

Area	Description
Inference API	`server.py` FastAPI app around a local GGUF model
Runtime model	`blitzkode.gguf` Q8_0 artifact, mounted/provided locally
Evaluation	`scripts/evaluate_model.py` plus `docs/evaluation_results.json`
Training/export	Dataset builders, LoRA training scripts, GGUF export utilities
Deployment	Python runtime `Dockerfile` and CPU/GPU `docker-compose.yml`

Key features

Local GGUF inference with no external model API calls
/generate JSON generation endpoint
/generate/stream SSE token streaming endpoint
/generate/research web-search-augmented endpoint
/search/web DuckDuckGo-backed search endpoint
Optional bearer-token auth, CORS controls, request-size limits, and per-IP rate limiting
llama.cpp-oriented performance controls: mmap loading, GPU layer offload, batch/micro-batch tuning, thread tuning, and optional prompt cache
Lightweight model regression evaluation harness

Quick start

Prerequisites:

Python 3.11+
blitzkode.gguf in the repo root, or BLITZKODE_MODEL_PATH pointing to the model
4 GB+ RAM for the bundled Q8_0 1.5B GGUF

pip install -r requirements.txt
python server.py
curl http://localhost:7860/health

The root route returns API status JSON. Use /info for endpoint metadata.

Docker

docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf:ro blitzkode

GPU profile, assuming NVIDIA container runtime and a GPU-enabled llama-cpp-python build:

BLITZKODE_GPU_LAYERS=-1 docker compose --profile gpu up --build

API examples

Streaming generation

curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

Non-streaming generation

curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

Research-augmented generation

curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

Search only

curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

Request parameters

Generation endpoints

Used by /generate and /generate/stream.

Parameter	Type	Default	Description
`prompt`	string	required	User request
`messages`	array	`[]`	Conversation history, max 20 messages
`temperature`	float	`0.5`	Sampling randomness, `0.0-2.0`
`max_tokens`	int	`256`	Max output tokens, capped at 512
`top_p`	float	`0.95`	Nucleus sampling
`top_k`	int	`20`	Top-k sampling
`repeat_penalty`	float	`1.05`	Repetition penalty

Research endpoint extras

Used by /generate/research.

Parameter	Type	Default	Description
`search_query`	string	prompt	Optional search-query override
`search_results`	int	`5`	Number of search snippets to inject
`deep_search`	bool	`false`	Also search docs/best-practice query variants

Search endpoint

Used by /search/web.

Parameter	Type	Default	Description
`query`	string	required	Web search query
`max_results`	int	`5`	Number of results to return
`deep`	bool	`false`	Multi-variant deep search

Configuration

Variable	Default	Description
`BLITZKODE_MODEL_PATH`	`blitzkode.gguf`	GGUF model path
`BLITZKODE_HOST`	`0.0.0.0`	Server bind host
`BLITZKODE_PORT`	`7860`	Server port
`BLITZKODE_GPU_LAYERS`	`0`	llama.cpp GPU-offload layers; use `-1` for all supported layers
`BLITZKODE_N_CTX`	`2048`	Context window
`BLITZKODE_THREADS`	auto	Decode threads
`BLITZKODE_THREADS_BATCH`	auto	Prompt-processing threads
`BLITZKODE_BATCH`	`256`	Prompt-processing batch size
`BLITZKODE_UBATCH`	`128`	llama.cpp micro-batch size
`BLITZKODE_PROMPT_CACHE`	`true`	Enable RAM prompt cache when supported
`BLITZKODE_PROMPT_CACHE_BYTES`	`67108864`	Prompt-cache capacity
`BLITZKODE_USE_MMAP`	`true`	Memory-map model weights
`BLITZKODE_USE_MLOCK`	`false`	Attempt to lock model pages in RAM
`BLITZKODE_OFFLOAD_KQV`	`true`	Offload K/Q/V operations when GPU layers are enabled
`BLITZKODE_PRELOAD_MODEL`	`false`	Load model during startup
`BLITZKODE_API_KEY`	empty	Optional bearer token
`BLITZKODE_CORS_ORIGINS`	`http://localhost:7860`	Comma-separated allowed origins
`BLITZKODE_WEB_SEARCH`	`true`	Enable search/research endpoints
`BLITZKODE_SEARCH_TIMEOUT`	`8`	Search timeout in seconds
`BLITZKODE_MAX_SEARCH_RESULTS`	`5`	Search result cap
`BLITZKODE_SEARCH_CACHE_TTL`	`300`	Search cache TTL in seconds
`BLITZKODE_RATE_LIMIT`	`true`	Enable per-IP rate limiting
`BLITZKODE_RATE_LIMIT_MAX`	`30`	Requests per IP per minute
`BLITZKODE_MAX_REQUEST_BYTES`	`50000`	Request body limit

Evaluation

Latest local GGUF smoke evaluation was run with:

python scripts/evaluate_model.py

Runtime: CPU, n_ctx=2048, threads=8, batch=256, gpu_layers=0.

Eval case	Result	Notes
Python factorial with negative-input handling	✅ Pass	Generated correct iterative code with `ValueError`
Iterative binary search	✅ Pass	Generated loop-based search returning index or `-1`
SQL top users by order count	✅ Pass	Generated `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT 5`
Unknown fictional API uncertainty	❌ Fail	Raw model hallucinated a plausible signature; API guardrails block this pattern in `/generate` and `/generate/stream`

Summary: 3 / 4 passed (75%). Full output is in docs/evaluation_results.json.

Evaluation-of-evaluation: this is a small deterministic smoke test for regression tracking, not a full benchmark. Stronger evaluation should add executable unit tests for generated code and benchmark-style suites such as HumanEval/MBPP-like tasks.

Training and export

BlitzKode was developed through staged LoRA fine-tuning and export:

Stage	Script	Purpose
SFT v1	`scripts/train_sft.py`	Curated coding examples
Reward-SFT	`scripts/train_reward_sft.py`	Heuristic quality continuation
DPO	`scripts/train_dpo.py`	Preference pairs for better answers and fewer hallucinations
Resource-aware SFT	`scripts/train_available.py`	Practical LoRA training run
Export	`scripts/export_production.py`	Merge/export to GGUF

Rebuild from scratch:

pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 \
  --seq-len 384 \
  --batch-size 1 \
  --grad-accum 8
python scripts/export_production.py

Project layout

BlitzKode/
  server.py                    FastAPI inference and search API
  blitzkode.gguf               Local model artifact, ignored by git
  scripts/evaluate_model.py    Lightweight GGUF evaluation harness
  docs/evaluation_results.json Latest smoke-eval output
  tests/test_server.py         Backend endpoint tests
  docs/                        Architecture, deployment, and HF model-card docs
  datasets/                    Dataset manifests and raw data location
  Dockerfile                   Python runtime image
  docker-compose.yml           CPU/GPU service definitions

Verification

python -m pytest tests/ -v
python -m ruff check .
python -m ruff format --check .
python -m mypy server.py --ignore-missing-imports
python -m compileall server.py tests scripts

Run model smoke evaluation separately because it requires the GGUF artifact:

python scripts/evaluate_model.py

Limitations

This is a small local model; review and test generated code.
Raw GGUF prompting can hallucinate fictional APIs; the server includes guardrails for common unknown-signature prompts.
Default context is 2,048 tokens; increase BLITZKODE_N_CTX only if memory allows.
The optional research endpoint uses web snippets as untrusted context and should not be treated as authoritative without verification.

License

MIT. See LICENSE. Also comply with the upstream Qwen2.5 license when redistributing derived model weights.

Created by Sajad (neuralbroker).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlitzKode

What this repo contains

Key features

Quick start

Docker

API examples

Streaming generation

Non-streaming generation

Research-augmented generation

Search only

Request parameters

Generation endpoints

Research endpoint extras

Search endpoint

Configuration

Evaluation

Training and export

Project layout

Verification

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
datasets		datasets
docs		docs
llama.cpp		llama.cpp
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

BlitzKode

What this repo contains

Key features

Quick start

Docker

API examples

Streaming generation

Non-streaming generation

Research-augmented generation

Search only

Request parameters

Generation endpoints

Research endpoint extras

Search endpoint

Configuration

Evaluation

Training and export

Project layout

Verification

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages