Standards and patterns for integrating with the LLM infrastructure module (infrastructure/llm/). This guide covers Ollama integration, response modes, validation, and manuscript review patterns.
flowchart LR
LLM[infrastructure/llm/]
LLM --> INIT[__init__.py<br/>Public API exports]
LLM --> CORE[core<br/>LLMClient · configuration · context]
LLM --> TPL[templates<br/>ResearchTemplate system]
LLM --> VAL[validation<br/>validate_complete · is_off_topic · checks]
LLM --> UT[utils/ollama.py<br/>Model discovery & Ollama helpers]
LLM --> CLI[cli/main.py<br/>Command-line interface]
LLM --> DOCS[AGENTS.md · README.md · SKILL.md]
classDef d fill:#0f172a,stroke:#0f172a,color:#fff
classDef code fill:#1e3a8a,stroke:#0f172a,color:#fff
classDef doc fill:#0f766e,stroke:#0f172a,color:#fff
class LLM d
class INIT,CORE,TPL,VAL,UT,CLI code
class DOCS doc
| Class | Purpose | Import |
|---|---|---|
LLMClient |
Main interface for queries | from infrastructure.llm import LLMClient |
OllamaClientConfig |
Configuration management | from infrastructure.llm import OllamaClientConfig |
GenerationOptions |
Per-query generation control | from infrastructure.llm import GenerationOptions |
validate_complete / is_off_topic |
Response validation (function-based; the old OutputValidator class was removed) |
from infrastructure.llm import validate_complete, is_off_topic |
from infrastructure.llm import OllamaClientConfig, LLMClient
# Load from environment (recommended)
config = OllamaClientConfig.from_env()
client = LLMClient(config)Environment Variables:
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
gemma3:4b |
Default model (quality-oriented local default) |
LLM_TEMPERATURE |
0.7 |
Generation temperature |
LLM_MAX_TOKENS |
2048 |
Max tokens per response |
LLM_SEED |
None |
Seed for reproducibility |
LLM_TIMEOUT |
60 |
Request timeout (seconds) |
# Custom configuration
config = OllamaClientConfig(
base_url="http://localhost:11434",
default_model="gemma3:4b",
temperature=0.7,
max_tokens=2048,
seed=42,
system_prompt="You are an expert researcher.",
)
# Override specific settings
fast_config = config.with_overrides(temperature=0.0, timeout=30.0)| Mode | Use When | Token Limit | Method |
|---|---|---|---|
| Short | Quick answers, definitions | < 200 | query_short() |
| Standard | General queries | Default config | query() |
| Long | Detailed analysis, reviews | > 4096 | query_long() |
| Structured | JSON output needed | Varies | query_structured() |
| Raw | No system prompt needed | Varies | query_raw() |
# Short response (< 200 tokens)
answer = client.query_short("What is machine learning?")
# Standard response with context
response = client.query("Explain this concept...")
# Long detailed response (> 4096 tokens)
analysis = client.query_long("Provide analysis...")
# Structured JSON response
result = client.query_structured(
"Extract key points...",
schema={"type": "object", "properties": {...}}
)
# Raw (no system prompt)
completion = client.query_raw("Complete: The quick brown fox")from infrastructure.llm import GenerationOptions
# Deterministic output
opts = GenerationOptions(
temperature=0.0,
seed=42,
)
response = client.query(prompt, options=opts)
# Full control
opts = GenerationOptions(
temperature=0.7, # Creativity (0.0=deterministic, 2.0=creative)
max_tokens=4096, # Max output tokens
top_p=0.9, # Nucleus sampling
top_k=40, # Top-k sampling
seed=42, # Reproducibility seed
stop=["END", "STOP"], # Stop sequences
format_json=True, # Force JSON output
repeat_penalty=1.1, # Repetition penalty
num_ctx=8192, # Context window
)- Per-query
GenerationOptions(highest priority) OllamaClientConfigdefaults- Environment variables (via
from_env())
# Top-level convenience entrypoints
from infrastructure.llm import validate_complete, is_off_topic
# Individual checks live in the validation subpackage
from infrastructure.llm.validation import (
validate_json,
validate_short_response,
validate_long_response,
validate_structure,
)
# Validate JSON
data = validate_json(response)
# Validate response length
validate_short_response(response) # max_tokens default 150
validate_long_response(response) # min_tokens default 500
# Validate structure against a schema
validate_structure(data, schema)
# Mode-aware validation (ResponseMode is a str-enum; "structured" also works)
validate_complete(response, mode="structured", schema=schema)For manuscript reviews, use the validation functions from the review subpackage:
from infrastructure.llm.review import validate_review_quality
from infrastructure.llm import is_off_topic
# Check for off-topic responses first
if is_off_topic(response):
# Response is confused - retry with format enforcement
pass
# Validate review quality (includes format compliance checks)
is_valid, issues, details = validate_review_quality(response, "executive_summary")
if not is_valid:
for issue in issues:
print(f" Issue: {issue}")
# Check format compliance details
format_info = details.get("format_compliance", {})
if format_info.get("emojis_found"):
print(f" Emojis detected: {format_info['emojis_found']}")Detect when the LLM has gone off-topic:
from infrastructure.llm import is_off_topic
# Patterns detected:
# - Email format (Re:, Dear, Subject:, From:, To:)
# - AI assistant phrases ("I'm happy to help", "As an AI")
# - Code responses when not expected
# - Casual greetings (Hi, Hello)
if is_off_topic(response):
# Reset context and retry
client.reset()
response = retry_with_format_enforcement(prompt)The validation system uses structure-only validation for general-purpose use. Emojis and tables are allowed to give LLMs formatting flexibility.
from infrastructure.llm.validation import (
detect_conversational_phrases,
check_format_compliance,
is_off_topic,
)
# Check for off-topic content first (most critical)
if is_off_topic(response):
client.reset() # Reset context
# Retry with reinforced prompt
# Check conversational phrases (warning only)
phrases = detect_conversational_phrases(response)
# Format compliance check (now only conversational phrases)
is_compliant, issues, details = check_format_compliance(response)
# Issues are logged as warnings, not hard failures
if not is_compliant:
logger.warning(f"Format warnings: {', '.join(issues)}")Validation Approach:
| Check | Type | Action on Fail |
|---|---|---|
| Off-topic content | Critical | Retry with context reset |
| Word count below minimum | Structural | Retry |
| Missing required headers | Structural | Retry |
| Conversational phrases | Warning | Log only |
| Emojis | Allowed | No action |
| Tables | Allowed | No action |
Allowed Formats:
- Emojis (✅ 🔑 📊 etc.) - useful for visual emphasis
- Markdown tables - useful for structured data
- Any valid markdown formatting
from infrastructure.llm import LLMClient, OllamaClientConfig, GenerationOptions
# Initialize client
config = OllamaClientConfig.from_env()
client = LLMClient(config)
# Generation options for reviews
options = GenerationOptions(
temperature=0.3, # Low temperature for consistency
max_tokens=4096, # Allow long responses
seed=42, # Reproducibility
)
# Use query() directly with detailed templates
# Avoid query_long() which adds conflicting instructions
response = client.query(template_prompt, options=options)When off-topic content is detected, the retry includes a reinforced prompt:
# On retry after off-topic detection
if had_off_topic:
current_prompt = (
"IMPORTANT: You must review the ACTUAL manuscript text provided below. "
"Do NOT generate hypothetical content, generic book descriptions, or "
"unrelated topics. Your review must reference specific content from "
"the manuscript.\n\n" + prompt
)
client.reset() # Always reset context for off-topic retriesThe prompt templates now use a manuscript-first structure with clear boundary markers:
=== MANUSCRIPT BEGIN ===
{full manuscript text}
=== MANUSCRIPT END ===
TASK: [specific review task]
REQUIREMENTS: [structural requirements]
This structure ensures the LLM focuses on the actual manuscript content.
def validate_review_quality(response: str, review_type: str) -> Tuple[bool, List[str]]:
issues = []
# Always check off-topic first
if is_off_topic(response):
issues.append("Response appears off-topic or confused")
return False, issues
# Type-specific validation
if review_type == "executive_summary":
# Check for required sections
headers = ["Overview", "Key Contributions", "Methodology", "Results", "Significance"]
found = sum(1 for h in headers if h.lower() in response.lower())
if found < 3:
issues.append(f"Missing expected structure (found {found}/5)")
elif review_type == "quality_review":
# Check for scoring
if not re.search(r'\*\*Score:\s*\d\*\*|Score:\s*\d', response, re.I):
issues.append("Missing scoring rubric")
elif review_type == "improvement_suggestions":
# Check for priority sections
priorities = ["high priority", "medium priority", "low priority"]
found = sum(1 for p in priorities if p in response.lower())
if found < 2:
issues.append(f"Missing priority sections (found {found}/3)")
elif review_type == "translation":
# Check for English abstract and translation sections
has_english = any([
"english abstract" in response.lower(),
"## english" in response.lower(),
])
has_translation = any([
"translation" in response.lower(),
"chinese" in response.lower(),
"hindi" in response.lower(),
"russian" in response.lower(),
])
if not has_english:
issues.append("Missing English abstract section")
if not has_translation:
issues.append("Missing translation section")
return len(issues) == 0, issuesfrom infrastructure.core.exceptions import (
LLMConnectionError,
LLMTemplateError,
ValidationError,
ContextLimitError,
)
try:
response = client.query(prompt)
except LLMConnectionError as e:
# Ollama not running or unreachable
logger.error(f"Connection failed: {e.context}")
# Fallback or graceful degradation
except ContextLimitError as e:
# Context too long
client.reset()
response = client.query(prompt) # Retry with fresh context
except ValidationError as e:
# Response validation failed
logger.warning(f"Validation issue: {e}")from infrastructure.llm import OllamaClientConfig
from infrastructure.llm.utils import is_ollama_running, select_best_model
# Check before starting
if not is_ollama_running():
print("Ollama not running - start with: ollama serve")
sys.exit(1)
# Select best available model
model = select_best_model()
if not model:
print("No Ollama models found - pull with: ollama pull gemma3:4b")
sys.exit(1)
config = OllamaClientConfig(default_model=model)Following the project's No Mocks Policy, LLM tests use:
- Pure Logic Tests - Configuration, validation, context (no network)
- Integration Tests - Marked with
@pytest.mark.requires_ollama
import pytest
# Pure logic test (no Ollama required)
def test_config_from_env(clean_llm_env):
"""Test configuration loading."""
os.environ["OLLAMA_HOST"] = "http://test:11434"
config = OllamaClientConfig.from_env()
assert config.base_url == "http://test:11434"
# Integration test (requires Ollama)
@pytest.mark.requires_ollama
class TestLLMIntegration:
@pytest.fixture(autouse=True)
def check_ollama(self, ensure_ollama_for_tests):
assert ensure_ollama_for_tests
def test_query(self):
client = LLMClient()
response = client.query("Hello")
assert len(response) > 0# Pure logic tests only (fast, no Ollama required)
uv run pytest tests/infra_tests/llm/ -m "not requires_ollama" -v
# All tests (requires running Ollama)
uv run pytest tests/infra_tests/llm/ -v
# Integration tests only
uv run pytest tests/infra_tests/llm/ -m requires_ollama -v
# With coverage
uv run pytest tests/infra_tests/llm/ --cov=infrastructure/llm --cov-report=html# Stream response to stdout
for chunk in client.stream_query("Write a poem about AI"):
print(chunk, end="", flush=True)
print() # Final newlineopts = GenerationOptions(temperature=0.9, seed=42)
for chunk in client.stream_query("Creative writing...", options=opts):
print(chunk, end="", flush=True)# Short mode streaming
for chunk in client.stream_short("Quick answer..."):
process_chunk(chunk)
# Long mode streaming
for chunk in client.stream_long("Detailed explanation..."):
process_chunk(chunk)| Template | Purpose | Parameters |
|---|---|---|
summarize_abstract |
Summarize research abstracts | text |
literature_review |
Synthesize multiple summaries | summaries |
code_doc |
Generate Python docstrings | code |
data_interpret |
Interpret statistical results | stats |
manuscript_translation_abstract |
Generate technical abstract and translate to target language | text, target_language |
# Apply a template
summary = client.apply_template(
"summarize_abstract",
text=abstract_text
)
# List available templates
from infrastructure.llm.templates import TEMPLATES # noqa: docs-lint
print(list(TEMPLATES.keys()))from infrastructure.llm.templates import ResearchTemplate
class MethodologyReview(ResearchTemplate):
template_str = """Analyze the methodology of this research:
${content}
Provide your analysis with these sections:
## Strengths
## Weaknesses
## Suggestions for Improvement
"""
# Register and use
template = MethodologyReview()
result = template.render(content=methodology_text)The system includes a translation template for generating technical abstracts in multiple languages:
from infrastructure.llm.templates import ManuscriptTranslationAbstract, TRANSLATION_LANGUAGES # noqa: docs-lint
# Supported languages
TRANSLATION_LANGUAGES = {
"zh": "Chinese (Simplified)",
"hi": "Hindi",
"ru": "Russian",
}
# Generate translation
template = ManuscriptTranslationAbstract()
prompt = template.render(
text=manuscript_text,
target_language=TRANSLATION_LANGUAGES["zh"]
)
response = client.query(prompt, options=options)Configuration:
Translations are configured in projects/{name}/manuscript/config.yaml:
llm:
translations:
enabled: true
languages:
- zh # Chinese (Simplified)
- hi # Hindi
- ru # RussianThe translation template generates:
- A 200-400 word technical abstract in English
- A translation in the target language
- Both sections with proper headers
client = LLMClient()
# Multi-turn conversation (context maintained)
response1 = client.query("What is X?")
response2 = client.query("Can you elaborate?") # Has context from response1# Reset context (clears history, re-injects system prompt)
client.reset()
# Query with fresh context
response = client.query("New topic", reset_context=True)
# Change system prompt
client.set_system_prompt("You are now a different persona...")The most critical failure mode is when the LLM generates content unrelated to the manuscript. Off-topic patterns include:
| Pattern Type | Examples | Action |
|---|---|---|
| AI refusal | "I can't help", "I don't have access" | Retry with context reset |
| Generic content | "This book is", "Chapter 1 deals with" | Retry with reinforced prompt |
| External URLs | http://, https:// links | Retry (indicates hallucination) |
| Unrelated code | CMakeLists.txt, .cpp files | Retry |
| Form/registration | "must be 18 years old", "valid email" | Retry |
LLMs can sometimes get stuck in output loops, repeating the same content. The validation system detects this:
| Unique Ratio | Severity | Action |
|---|---|---|
| > 80% | Normal | Pass validation |
| 50-80% | Warning | Log warning, continue |
| < 50% | Error | Fail validation, retry |
from infrastructure.llm.validation import detect_repetition, deduplicate_sections # noqa: docs-lint
# Check for repetition
has_rep, duplicates, unique_ratio = detect_repetition(response)
if unique_ratio < 0.5:
# Severe repetition - retry with different temperature
client.reset()
response = client.query(prompt, options=GenerationOptions(temperature=0.5))
# Post-process to clean up any remaining repetition
cleaned = deduplicate_sections(response, max_repetitions=2)The validation is automatically integrated into validate_review_quality():
is_valid, issues, details = validate_review_quality(response, "executive_summary")
# Check repetition details
if details.get("repetition", {}).get("has_repetition"):
print(f"Warning: {details['repetition']['unique_ratio']:.0%} unique content")Issue: Response is completely off-topic (hallucinated content)
# Detected by is_off_topic() - checks for:
# - AI refusal phrases ("I can't help")
# - Generic book/guide language
# - External URLs
# - Unrelated programming content
# Resolution: Context reset + reinforced prompt on retry
if had_off_topic:
client.reset()
current_prompt = "IMPORTANT: Review the ACTUAL manuscript..." + promptIssue: Response too short
The validation system uses minimum word counts to ensure substantive reviews:
| Review Type | Minimum Words | Template Target | Notes |
|---|---|---|---|
| executive_summary | 250 | 400-600 | Full manuscript overview |
| quality_review | 300 | 500-700 | Detailed quality assessment |
| methodology_review | 300 | 500-700 | Methods and structure analysis |
| improvement_suggestions | 200 | 500-800 | Focused actionable items |
| translation | 400 | 200-400 (English) + translation | Technical abstract + target language |
# Small models (3B-8B) get 20% lower thresholds automatically
# e.g., improvement_suggestions minimum: 200 → 160 for small modelsIssue: Missing required headers
# Each review type expects specific headers
# Validation is flexible - accepts variations:
# "Overview" OR "Summary" OR "Introduction"
# "Methodology" OR "Methods" OR "Approach"
# Only requires 1+ sections to pass (very flexible)Issue: Conversational AI phrases
# Detected phrases like:
# "Based on the document you shared"
# "I'll help you understand"
# "Let me know if you need more"
# These are logged as warnings only - not retry triggers# Use environment configuration
config = OllamaClientConfig.from_env()
# Use GenerationOptions for per-query control
opts = GenerationOptions(temperature=0.0, seed=42)
# Validate responses before using
is_valid, issues = validate_review_quality(response, review_type)
# Check off-topic before structural validation
if is_off_topic(response):
# Retry immediately
# Use format enforcement on retry
if attempt > 0:
prompt = FORMAT_ENFORCEMENT[type] + original_prompt
# Check connection before long operations
if not is_ollama_running():
handle_no_ollama()# Don't use query_long() with detailed templates (double instructions)
response = client.query_long(detailed_template) # BAD
# Don't skip off-topic detection
is_valid, issues = validate_review_quality(response, type) # Check off-topic first
# Don't ignore validation failures
if not is_valid:
pass # BAD - should retry or handle
# Don't hardcode model names unnecessarily
config = OllamaClientConfig(default_model="some-model") # Use env vars or from_env() instead- Ollama running (
ollama serve) - Model available (
ollama list) - Environment variables configured
- Error handling in place
- Use
query()notquery_long()with detailed templates - Implement off-topic detection
- Add format enforcement for retries
- Validate response structure
- Track best response across retries
- Log validation issues
- Pure logic tests without network
- Deterministic HTTP via
pytest_httpserver(tests/infra_tests/llm/conftest.py,ollama_stub_server.py); default pytest selection uses-m "not requires_ollama" - Integration tests marked
@pytest.mark.requires_ollamafor a real local daemon - No
MagicMock/unittest.mock.patch; use real subprocesses, real stub scripts (e.g.pull_ollama_model(..., which=...), shell stubs intmp_path), or environment-only control - Clean environment fixtures where isolation matters
- Coverage for validation paths that are user-facing
Logging: avoid emitting raw prompts or full manuscript text at INFO; keep detailed content at DEBUG or behind explicit diagnostics (see python_logging.md). Input hygiene lives in infrastructure/llm/core/sanitization.py; post-hoc checks in infrastructure/llm/validation/.
- infrastructure_modules.md - Module development standards
- testing_standards.md - Testing patterns
- error_handling.md - Exception handling
- python_logging.md - Logging standards
- infrastructure/llm/AGENTS.md - Detailed LLM documentation
- infrastructure/llm/README.md - Quick reference
Version: 1.2.0 Last Updated: 2026-04-08 Status: (includes translation feature; testing guidance aligned with pytest_httpserver suite) Maintainer: Template Team