βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββ
I don't follow tutorials. I derive equations.
I don't ship demos. I ship systems that survive production.
I don't guess. I benchmark, instrument, and iterate.
I'm an AI engineer who builds from first principles β scratch implementation β hardened deployment. Every system I create is mathematically grounded, rigorously tested, and engineered for failure resilience.
class Amman:
stack = ["LLMs", "RAG", "MLOps", "Transformers", "Evaluation"]
languages = ["Python", "SQL", "Bash"]
approach = "derive β implement from scratch β harden β benchmark β ship"
based_in = "India"
available = "remote, globally"
building = True # alwaysβ Β ML-from-Scratch Β Β·Β completed
10 ML algorithms. Pure NumPy. Zero sklearn. Every formula derived by hand.
Before touching any framework, I sat down with the mathematics and built everything from scratch β linear models, kernel methods, ensemble methods, dimensionality reduction. Each algorithm comes with a full derivation document, visual comparisons against sklearn, and benchmarks proving identical outputs.
This repo exists to prove one thing: I understand the math, not just the API.
algorithms β Linear Regression (OLS + gradient descent + Ridge + Lasso)
Logistic Regression (binary + multiclass + regularized)
K-Nearest Neighbors (classification + regression)
K-Means Clustering (elbow method + silhouette analysis)
Naive Bayes (Gaussian + Multinomial + Bernoulli)
Decision Trees (CART + pruning)
Random Forests (bagging + feature importance)
Support Vector Machines (linear + kernel)
Principal Component Analysis
Gradient Boosting
testing β 100% unit tested against sklearn β identical outputs verified
docs β every algorithm has derivation β intuition β code β result
Python NumPy Matplotlib Math-first Unit tested
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π PRODUCTION SYSTEMS Β· ACTIVE DEVELOPMENT Β· LIVE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π₯ Β Agentic-Ai-Production-System Β Β·Β active
A multi-agent orchestration system built for production β not a demo, not a prototype.
Most "agentic AI" projects are chains wrapped in Streamlit. This is different. It's a full production system with instrumentation, safety, evaluation gates, and a feedback loop that fine-tunes the model on real user interactions.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Request βββΆ FastAPI βββΆ LangGraph Orchestrator β
β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ β
β Planner Executor Reflectorβ
β β β β β
β βββββββββββββββββΌββββββββββββββββ β
β β β
β βββββββββββββββββββββββΌβββββββββββββββ β
β βΌ βΌ βΌ β
β RAG Pipeline Tool Sandbox Safety β
β (hybrid search) (Docker) Guards β
β β β β β
β βββββββββββββββββββββββΌβββββββββββββββ β
β β β
β Prometheus ββ Langfuse ββ Audit Logs (S3) β
β β β
β Human Approval Gate β
β β β
β LoRA Fine-tuning on Feedback β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What makes it production-grade:
- β Circuit breakers on every external call β no silent failures
- β PII scrubbing before any data touches the LLM
- β RAGAS evaluation runs on every PR β merge blocked on faithfulness regression
- β Human-in-the-loop approval gate before irreversible tool actions
- β Every interaction logged to S3 for compliance and replay
- β LoRA fine-tuning loop trained on collected thumbs-up/down feedback
- β Multi-tenant rate limiting with token bucket per API key
LangGraph FastAPI Qdrant Docker Kubernetes RAGAS Prometheus Langfuse LoRA Redis
π₯ Β LLM-Gateway-Platform Β Β·Β active
A routing layer that sits in front of any LLM provider. Optimized routing. Semantic caching. Automatic fallback.
The problem: you're calling OpenAI directly, paying full price on cache-able queries, and one provider outage takes your whole system down. This gateway solves all three.
# route by strategy β gateway picks the optimal provider automatically
response = gateway.complete(prompt, strategy="cost") # β cheapest model available
response = gateway.complete(prompt, strategy="speed") # β lowest p99 latency
response = gateway.complete(prompt, strategy="safe") # β circuit-broken fallback chain
# semantic cache β similar queries return cached response
# "what is gradient descent?" and "explain gradient descent" β same cache hitHow the routing works:
Incoming Request
β
βΌ
Auth + Rate Limit
β
βΌ
Semantic Cache ββββ HIT βββββββββββββββββββΆ Return cached response
β
MISS
β
βΌ
Router (cost / speed / safe)
β
ββββΆ OpenAI
ββββΆ Anthropic
ββββΆ Together AI
ββββΆ Local vLLM
β
βΌ
Circuit Breaker ββββ OPEN βββΆ Fallback chain
β
CLOSED
β
βΌ
Response + Metrics (Prometheus) + Traces (OpenTelemetry)
Chaos engineering included β a test suite that randomly kills providers mid-run, verifies circuit breakers open, and confirms fallback activates within SLA. Because a gateway you haven't deliberately broken isn't a gateway you can trust.
FastAPI Redis OpenTelemetry Grafana Locust Terraform Kubernetes
π₯ Β GPT-Engineer-Kit Β Β·Β active
GPT-2 implemented twice. Once for clarity, once for performance. BPE tokenizer from scratch. Benchmarked.
Two complete implementations in one repo:
legacy/ β clean, annotated, readable
every operation mapped to the paper
for understanding the architecture
optimized/ β FlashAttention v2
Rotary Position Embeddings (RoPE)
PagedAttention-style KV cache
SwiGLU MLP
torch.compile
FP8 quantization stubs
FSDP distributed training wrapper
BPE tokenizer built from scratch β merge rules, vocabulary, encode/decode β before touching HuggingFace tokenizers.
Also includes stubs for alternative architectures: Mamba (selective SSMs), Hyena operators, RWKV β for when attention isn't the answer.
benchmarks vs nanoGPT:
perplexity β WikiText-2, measured at every checkpoint
throughput β tokens/sec at batch sizes 1, 8, 32, 128
memory β peak GPU memory per optimization added
compilation β torch.compile speedup measured independently
PyTorch CUDA FlashAttention FSDP FP8 torch.compile Mamba RWKV
β‘ Β LLM-Evaluation-Framework Β Β·Β building
Evaluate any LLM system in 3 lines. Block any deployment that regresses.
Most teams deploy LLMs and hope quality holds. This framework makes quality a hard gate.
from llm_eval import Evaluator
# run evaluation
results = Evaluator(
metrics=["faithfulness", "hallucination", "relevancy", "answer_correctness"]
).run(predictions, references)
# block CI on regression
results.assert_threshold(faithfulness=0.85, hallucination=0.05)
# compare two model versions
dashboard.compare(results_v1, results_v2) # opens Streamlit diff viewWhat it evaluates:
offline β RAGAS (faithfulness, context recall, answer relevancy)
DeepEval (GEval, answer correctness, hallucination detection)
custom metrics (tool call accuracy, cost per query, latency)
online β stream real queries to Kafka/S3
monitor input distribution drift
log real-world failure cases
ci/cd β assert_threshold() blocks merges on regression
nightly benchmark runs with variance analysis
Streamlit dashboard: compare any two model versions
Why this closes the loop: this framework runs against every other repo I build. The agentic system is evaluated here. The gateway is benchmarked here. The GPT kit's generations are scored here. One place to know if quality is holding.
RAGAS DeepEval Streamlit Kafka Prometheus FastAPI Langfuse
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CORE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python Β· PyTorch Β· NumPy Β· HuggingFace Transformers
LangChain Β· LangGraph Β· FastAPI Β· Pydantic
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LLM ENGINEERING
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Fine-tuning (LoRA Β· QLoRA) Β· RLHF Β· RAG Pipelines
Prompt Engineering Β· LLM-as-judge Β· Speculative Decoding
FlashAttention Β· RoPE Β· KV Cache Β· FP8 Quantization
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EVALUATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
RAGAS Β· DeepEval Β· Custom Metrics Β· Langfuse Β· Prometheus
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
VECTOR SEARCH
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FAISS Β· Qdrant Β· Pinecone Β· Weaviate
Dense + Sparse + Hybrid Retrieval Β· Cross-encoder Reranking
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MLOPS & INFRA
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Docker Β· Kubernetes Β· Helm Β· GitHub Actions Β· Terraform
Prometheus Β· Grafana Β· OpenTelemetry Β· Locust
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CLOUD
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS β EC2 Β· S3 Β· Lambda Β· SageMaker Β· EKS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DATA
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PostgreSQL Β· MongoDB Β· Redis Β· Celery Β· Kafka
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MATHEMATICS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Calculus Β· Linear Algebra Β· Probability Theory Β· Statistics
I contribute to the ecosystem, not just consume it. Every repo I build is designed to be forked, extended, and built on β with derivations others can follow, benchmarks others can reproduce, and post-mortems others can learn from.
Actively looking to contribute to:
HuggingFace Transformers β evaluation, documentation, reproducibility
RAGAS β custom metrics, edge case coverage
DeepEval β metric implementations, CI integrations
vLLM β inference optimization experiments
LangGraph β production patterns, reliability improvements
The goal: leave every project I touch more testable, more documented, and more honest about its failure modes than I found it.
Every repo I ship clears five gates before merge:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β 01 WHY DOES THIS WORK? β
β Mathematical derivation lives in docs/ β
β No black boxes. No "trust the framework." β
β β
β 02 HOW DOES IT WORK? β
β Scratch implementation before any library β
β If I can't write it in NumPy, I don't use it β
β β
β 03 DOES IT ACTUALLY WORK? β
β Benchmarks with real numbers, not vibes β
β Tested against reference implementations β
β β
β 04 WHAT BROKE? β
β Post-mortems documented in docs/failures.md β
β Failures are first-class content, not hidden β
β β
β 05 CAN IT HANDLE PRODUCTION? β
β Failure modes mapped. Fallbacks implemented. β
β Load tested. Circuit breakers in place. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Some projects live in private repos. Some are in closed beta. Some are being stress-tested with real users before the world sees them.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― CURRENT FOCUS: PRODUCTION-GRADE AI PLATFORM β
β β
β β’ Multi-tenant architecture with usage metering β
β β’ Real-user feedback loops driving model iteration β
β β’ End-to-end observability: logs, traces, metrics β
β β’ Auth, billing, and rate-limiting baked in from day 1 β
β β’ Frontend that doesn't suck β because UX matters β
β β
β Status: π§ Private beta Β· Invite-only Β· Real traffic β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Β Research Explorations (prototypes)
Ideas I'm stress-testing in isolated repos. Not public yet. Not polished. But mathematically sound.
π§ͺ multimodal-data-interpreter
ββ PDF + Excel + images + audio β unified query interface
ββ Natural language β SQL / Python / charts
ββ Auto-dashboard generation with live data refresh
ββ Scalable backend: DuckDB/Spark for >RAM datasets
π§ͺ autonomous-code-reviewer
ββ Agentic PR analysis: bugs, perf, security, style
ββ Test generation + sandboxed execution
ββ Human-in-loop approval gates (reuse production patterns)
ββ GitHub API integration + CI/CD hooks
π§ͺ real-time-meeting-copilot
ββ Live transcription + action item extraction
ββ Sentiment + engagement analytics
ββ Post-meeting RAG: "What did John say about the deadline?"
ββ Privacy-first: local inference + on-prem LLM fallback
These are research prototypes. If they survive benchmarking, hardening, and real-user testing β they'll graduate to production repos.
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
AMMAN HUSSAIN ANSARI
AI Engineer Β· MLOps Β· Open Source Contributor
India Β· Remote Β· Globally Available
βββββββββββββββββββββββββββββββββββββββββββββββββββββ

