Skip to content

neuralbroker/blitzkode

Repository files navigation

language
en
license mit
library_name llama-cpp-python
pipeline_tag text-generation
tags
code-generation
coding-assistant
gguf
llama.cpp
qwen2.5
local-inference
fastapi
base_model
Qwen/Qwen2.5-1.5B-Instruct

BlitzKode

BlitzKode is a local, API-first coding assistant built around a fine-tuned Qwen2.5 model and served with llama-cpp-python. The project is intentionally backend-only: FastAPI exposes generation, streaming, retrieval-augmented generation, health, and metadata endpoints while the model stays on your machine.

What this repo contains

Area Description
Inference API server.py FastAPI app around a local GGUF model
Runtime model blitzkode.gguf Q8_0 artifact, mounted/provided locally
Evaluation scripts/evaluate_model.py plus docs/evaluation_results.json
Training/export Dataset builders, LoRA training scripts, GGUF export utilities
Deployment Python runtime Dockerfile and CPU/GPU docker-compose.yml

Key features

  • Local GGUF inference with no external model API calls
  • /generate JSON generation endpoint
  • /generate/stream SSE token streaming endpoint
  • /generate/research web-search-augmented endpoint
  • /search/web DuckDuckGo-backed search endpoint
  • Optional bearer-token auth, CORS controls, request-size limits, and per-IP rate limiting
  • llama.cpp-oriented performance controls: mmap loading, GPU layer offload, batch/micro-batch tuning, thread tuning, and optional prompt cache
  • Lightweight model regression evaluation harness

Quick start

Prerequisites:

  • Python 3.11+
  • blitzkode.gguf in the repo root, or BLITZKODE_MODEL_PATH pointing to the model
  • 4 GB+ RAM for the bundled Q8_0 1.5B GGUF
pip install -r requirements.txt
python server.py
curl http://localhost:7860/health

The root route returns API status JSON. Use /info for endpoint metadata.

Docker

docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf:ro blitzkode

GPU profile, assuming NVIDIA container runtime and a GPU-enabled llama-cpp-python build:

BLITZKODE_GPU_LAYERS=-1 docker compose --profile gpu up --build

API examples

Streaming generation

curl -X POST http://localhost:7860/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python function to reverse a linked list"}'

Non-streaming generation

curl -X POST http://localhost:7860/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Binary search in Python","max_tokens":128}'

Research-augmented generation

curl -X POST http://localhost:7860/generate/research \
  -H "Content-Type: application/json" \
  -d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'

Search only

curl -X POST http://localhost:7860/search/web \
  -H "Content-Type: application/json" \
  -d '{"query":"FastAPI dependency injection","max_results":3}'

Request parameters

Generation endpoints

Used by /generate and /generate/stream.

Parameter Type Default Description
prompt string required User request
messages array [] Conversation history, max 20 messages
temperature float 0.5 Sampling randomness, 0.0-2.0
max_tokens int 256 Max output tokens, capped at 512
top_p float 0.95 Nucleus sampling
top_k int 20 Top-k sampling
repeat_penalty float 1.05 Repetition penalty

Research endpoint extras

Used by /generate/research.

Parameter Type Default Description
search_query string prompt Optional search-query override
search_results int 5 Number of search snippets to inject
deep_search bool false Also search docs/best-practice query variants

Search endpoint

Used by /search/web.

Parameter Type Default Description
query string required Web search query
max_results int 5 Number of results to return
deep bool false Multi-variant deep search

Configuration

Variable Default Description
BLITZKODE_MODEL_PATH blitzkode.gguf GGUF model path
BLITZKODE_HOST 0.0.0.0 Server bind host
BLITZKODE_PORT 7860 Server port
BLITZKODE_GPU_LAYERS 0 llama.cpp GPU-offload layers; use -1 for all supported layers
BLITZKODE_N_CTX 2048 Context window
BLITZKODE_THREADS auto Decode threads
BLITZKODE_THREADS_BATCH auto Prompt-processing threads
BLITZKODE_BATCH 256 Prompt-processing batch size
BLITZKODE_UBATCH 128 llama.cpp micro-batch size
BLITZKODE_PROMPT_CACHE true Enable RAM prompt cache when supported
BLITZKODE_PROMPT_CACHE_BYTES 67108864 Prompt-cache capacity
BLITZKODE_USE_MMAP true Memory-map model weights
BLITZKODE_USE_MLOCK false Attempt to lock model pages in RAM
BLITZKODE_OFFLOAD_KQV true Offload K/Q/V operations when GPU layers are enabled
BLITZKODE_PRELOAD_MODEL false Load model during startup
BLITZKODE_API_KEY empty Optional bearer token
BLITZKODE_CORS_ORIGINS http://localhost:7860 Comma-separated allowed origins
BLITZKODE_WEB_SEARCH true Enable search/research endpoints
BLITZKODE_SEARCH_TIMEOUT 8 Search timeout in seconds
BLITZKODE_MAX_SEARCH_RESULTS 5 Search result cap
BLITZKODE_SEARCH_CACHE_TTL 300 Search cache TTL in seconds
BLITZKODE_RATE_LIMIT true Enable per-IP rate limiting
BLITZKODE_RATE_LIMIT_MAX 30 Requests per IP per minute
BLITZKODE_MAX_REQUEST_BYTES 50000 Request body limit

Evaluation

Latest local GGUF smoke evaluation was run with:

python scripts/evaluate_model.py

Runtime: CPU, n_ctx=2048, threads=8, batch=256, gpu_layers=0.

Eval case Result Notes
Python factorial with negative-input handling ✅ Pass Generated correct iterative code with ValueError
Iterative binary search ✅ Pass Generated loop-based search returning index or -1
SQL top users by order count ✅ Pass Generated JOIN, GROUP BY, ORDER BY, and LIMIT 5
Unknown fictional API uncertainty ❌ Fail Raw model hallucinated a plausible signature; API guardrails block this pattern in /generate and /generate/stream

Summary: 3 / 4 passed (75%). Full output is in docs/evaluation_results.json.

Evaluation-of-evaluation: this is a small deterministic smoke test for regression tracking, not a full benchmark. Stronger evaluation should add executable unit tests for generated code and benchmark-style suites such as HumanEval/MBPP-like tasks.

Training and export

BlitzKode was developed through staged LoRA fine-tuning and export:

Stage Script Purpose
SFT v1 scripts/train_sft.py Curated coding examples
Reward-SFT scripts/train_reward_sft.py Heuristic quality continuation
DPO scripts/train_dpo.py Preference pairs for better answers and fewer hallucinations
Resource-aware SFT scripts/train_available.py Practical LoRA training run
Export scripts/export_production.py Merge/export to GGUF

Rebuild from scratch:

pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --quantization none \
  --dataset datasets/raw/blitzkode_full_training.json \
  --max-steps 100 \
  --seq-len 384 \
  --batch-size 1 \
  --grad-accum 8
python scripts/export_production.py

Project layout

BlitzKode/
  server.py                    FastAPI inference and search API
  blitzkode.gguf               Local model artifact, ignored by git
  scripts/evaluate_model.py    Lightweight GGUF evaluation harness
  docs/evaluation_results.json Latest smoke-eval output
  tests/test_server.py         Backend endpoint tests
  docs/                        Architecture, deployment, and HF model-card docs
  datasets/                    Dataset manifests and raw data location
  Dockerfile                   Python runtime image
  docker-compose.yml           CPU/GPU service definitions

Verification

python -m pytest tests/ -v
python -m ruff check .
python -m ruff format --check .
python -m mypy server.py --ignore-missing-imports
python -m compileall server.py tests scripts

Run model smoke evaluation separately because it requires the GGUF artifact:

python scripts/evaluate_model.py

Limitations

  • This is a small local model; review and test generated code.
  • Raw GGUF prompting can hallucinate fictional APIs; the server includes guardrails for common unknown-signature prompts.
  • Default context is 2,048 tokens; increase BLITZKODE_N_CTX only if memory allows.
  • The optional research endpoint uses web snippets as untrusted context and should not be treated as authoritative without verification.

License

MIT. See LICENSE. Also comply with the upstream Qwen2.5 license when redistributing derived model weights.


Created by Sajad (neuralbroker).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages