| language |
|
|||||||
|---|---|---|---|---|---|---|---|---|
| license | mit | |||||||
| library_name | llama-cpp-python | |||||||
| pipeline_tag | text-generation | |||||||
| tags |
|
|||||||
| base_model |
|
BlitzKode is a local, API-first coding assistant built around a fine-tuned Qwen2.5 model and served with llama-cpp-python. The project is intentionally backend-only: FastAPI exposes generation, streaming, retrieval-augmented generation, health, and metadata endpoints while the model stays on your machine.
- Production model:
neuralbroker/blitzkode - 1.5B LoRA adapter:
neuralbroker/blitzkode-1.5b-lora - 0.5B LoRA adapter:
neuralbroker/blitzkode-lora-0.5b - License: MIT project code; also comply with the upstream Qwen2.5 license for model redistribution.
| Area | Description |
|---|---|
| Inference API | server.py FastAPI app around a local GGUF model |
| Runtime model | blitzkode.gguf Q8_0 artifact, mounted/provided locally |
| Evaluation | scripts/evaluate_model.py plus docs/evaluation_results.json |
| Training/export | Dataset builders, LoRA training scripts, GGUF export utilities |
| Deployment | Python runtime Dockerfile and CPU/GPU docker-compose.yml |
- Local GGUF inference with no external model API calls
/generateJSON generation endpoint/generate/streamSSE token streaming endpoint/generate/researchweb-search-augmented endpoint/search/webDuckDuckGo-backed search endpoint- Optional bearer-token auth, CORS controls, request-size limits, and per-IP rate limiting
- llama.cpp-oriented performance controls: mmap loading, GPU layer offload, batch/micro-batch tuning, thread tuning, and optional prompt cache
- Lightweight model regression evaluation harness
Prerequisites:
- Python 3.11+
blitzkode.ggufin the repo root, orBLITZKODE_MODEL_PATHpointing to the model- 4 GB+ RAM for the bundled Q8_0 1.5B GGUF
pip install -r requirements.txt
python server.py
curl http://localhost:7860/healthThe root route returns API status JSON. Use /info for endpoint metadata.
docker build -t blitzkode .
docker run -p 7860:7860 -v ./blitzkode.gguf:/app/blitzkode.gguf:ro blitzkodeGPU profile, assuming NVIDIA container runtime and a GPU-enabled llama-cpp-python build:
BLITZKODE_GPU_LAYERS=-1 docker compose --profile gpu up --buildcurl -X POST http://localhost:7860/generate/stream \
-H "Content-Type: application/json" \
-d '{"prompt":"Write a Python function to reverse a linked list"}'curl -X POST http://localhost:7860/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Binary search in Python","max_tokens":128}'curl -X POST http://localhost:7860/generate/research \
-H "Content-Type: application/json" \
-d '{"prompt":"How do I use async generators in Python 3.12?","deep_search":true}'curl -X POST http://localhost:7860/search/web \
-H "Content-Type: application/json" \
-d '{"query":"FastAPI dependency injection","max_results":3}'Used by /generate and /generate/stream.
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | User request |
messages |
array | [] |
Conversation history, max 20 messages |
temperature |
float | 0.5 |
Sampling randomness, 0.0-2.0 |
max_tokens |
int | 256 |
Max output tokens, capped at 512 |
top_p |
float | 0.95 |
Nucleus sampling |
top_k |
int | 20 |
Top-k sampling |
repeat_penalty |
float | 1.05 |
Repetition penalty |
Used by /generate/research.
| Parameter | Type | Default | Description |
|---|---|---|---|
search_query |
string | prompt | Optional search-query override |
search_results |
int | 5 |
Number of search snippets to inject |
deep_search |
bool | false |
Also search docs/best-practice query variants |
Used by /search/web.
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
string | required | Web search query |
max_results |
int | 5 |
Number of results to return |
deep |
bool | false |
Multi-variant deep search |
| Variable | Default | Description |
|---|---|---|
BLITZKODE_MODEL_PATH |
blitzkode.gguf |
GGUF model path |
BLITZKODE_HOST |
0.0.0.0 |
Server bind host |
BLITZKODE_PORT |
7860 |
Server port |
BLITZKODE_GPU_LAYERS |
0 |
llama.cpp GPU-offload layers; use -1 for all supported layers |
BLITZKODE_N_CTX |
2048 |
Context window |
BLITZKODE_THREADS |
auto | Decode threads |
BLITZKODE_THREADS_BATCH |
auto | Prompt-processing threads |
BLITZKODE_BATCH |
256 |
Prompt-processing batch size |
BLITZKODE_UBATCH |
128 |
llama.cpp micro-batch size |
BLITZKODE_PROMPT_CACHE |
true |
Enable RAM prompt cache when supported |
BLITZKODE_PROMPT_CACHE_BYTES |
67108864 |
Prompt-cache capacity |
BLITZKODE_USE_MMAP |
true |
Memory-map model weights |
BLITZKODE_USE_MLOCK |
false |
Attempt to lock model pages in RAM |
BLITZKODE_OFFLOAD_KQV |
true |
Offload K/Q/V operations when GPU layers are enabled |
BLITZKODE_PRELOAD_MODEL |
false |
Load model during startup |
BLITZKODE_API_KEY |
empty | Optional bearer token |
BLITZKODE_CORS_ORIGINS |
http://localhost:7860 |
Comma-separated allowed origins |
BLITZKODE_WEB_SEARCH |
true |
Enable search/research endpoints |
BLITZKODE_SEARCH_TIMEOUT |
8 |
Search timeout in seconds |
BLITZKODE_MAX_SEARCH_RESULTS |
5 |
Search result cap |
BLITZKODE_SEARCH_CACHE_TTL |
300 |
Search cache TTL in seconds |
BLITZKODE_RATE_LIMIT |
true |
Enable per-IP rate limiting |
BLITZKODE_RATE_LIMIT_MAX |
30 |
Requests per IP per minute |
BLITZKODE_MAX_REQUEST_BYTES |
50000 |
Request body limit |
Latest local GGUF smoke evaluation was run with:
python scripts/evaluate_model.pyRuntime: CPU, n_ctx=2048, threads=8, batch=256, gpu_layers=0.
| Eval case | Result | Notes |
|---|---|---|
| Python factorial with negative-input handling | ✅ Pass | Generated correct iterative code with ValueError |
| Iterative binary search | ✅ Pass | Generated loop-based search returning index or -1 |
| SQL top users by order count | ✅ Pass | Generated JOIN, GROUP BY, ORDER BY, and LIMIT 5 |
| Unknown fictional API uncertainty | ❌ Fail | Raw model hallucinated a plausible signature; API guardrails block this pattern in /generate and /generate/stream |
Summary: 3 / 4 passed (75%). Full output is in docs/evaluation_results.json.
Evaluation-of-evaluation: this is a small deterministic smoke test for regression tracking, not a full benchmark. Stronger evaluation should add executable unit tests for generated code and benchmark-style suites such as HumanEval/MBPP-like tasks.
BlitzKode was developed through staged LoRA fine-tuning and export:
| Stage | Script | Purpose |
|---|---|---|
| SFT v1 | scripts/train_sft.py |
Curated coding examples |
| Reward-SFT | scripts/train_reward_sft.py |
Heuristic quality continuation |
| DPO | scripts/train_dpo.py |
Preference pairs for better answers and fewer hallucinations |
| Resource-aware SFT | scripts/train_available.py |
Practical LoRA training run |
| Export | scripts/export_production.py |
Merge/export to GGUF |
Rebuild from scratch:
pip install -r requirements-training.txt
python scripts/build_full_dataset.py
python scripts/train_available.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--quantization none \
--dataset datasets/raw/blitzkode_full_training.json \
--max-steps 100 \
--seq-len 384 \
--batch-size 1 \
--grad-accum 8
python scripts/export_production.pyBlitzKode/
server.py FastAPI inference and search API
blitzkode.gguf Local model artifact, ignored by git
scripts/evaluate_model.py Lightweight GGUF evaluation harness
docs/evaluation_results.json Latest smoke-eval output
tests/test_server.py Backend endpoint tests
docs/ Architecture, deployment, and HF model-card docs
datasets/ Dataset manifests and raw data location
Dockerfile Python runtime image
docker-compose.yml CPU/GPU service definitions
python -m pytest tests/ -v
python -m ruff check .
python -m ruff format --check .
python -m mypy server.py --ignore-missing-imports
python -m compileall server.py tests scriptsRun model smoke evaluation separately because it requires the GGUF artifact:
python scripts/evaluate_model.py- This is a small local model; review and test generated code.
- Raw GGUF prompting can hallucinate fictional APIs; the server includes guardrails for common unknown-signature prompts.
- Default context is 2,048 tokens; increase
BLITZKODE_N_CTXonly if memory allows. - The optional research endpoint uses web snippets as untrusted context and should not be treated as authoritative without verification.
MIT. See LICENSE. Also comply with the upstream Qwen2.5 license when redistributing derived model weights.
Created by Sajad (neuralbroker).