A natural language to JQL (Jira Query Language) generator using RAG (Retrieval-Augmented Generation) with pgvector. Supports multiple LLM backends: local Ollama, vLLM (GPU inference server), Groq cloud, Anthropic Claude direct API, and AWS Bedrock-compatible endpoints. Returns structured JSON with a JQL query, a chart specification, and a plain-text answer. A two-stage router answers general questions immediately without touching the JQL pipeline. Try it live: atlasmind.de
- PostgreSQL with the
pgvectorextension - One of the following LLM backends:
- Ollama running locally with a model loaded (default:
qwen2.5:3b-instruct-q4_K_M) - A Groq API key (
GROQ_API_KEY) - A vLLM inference server (
VLLM_URL) - An Anthropic API key (
CLAUDE_API_KEY) for Claude direct - An AWS Bedrock-compatible endpoint + bearer token for
--model bedrock
- Ollama running locally with a model loaded (default:
- Python 3.12+,
uv
uv syncRun these once against your active Jira profile before starting the server for the first time, or after switching to a new Jira instance:
# Fetch all Jira field metadata and cache locally
uv run python -c "from jira.jira_field_api import fetch_and_save_fields; fetch_and_save_fields()"
# Fetch allowed values for all eligible fields (status, priority, issue types, custom options)
uv run python -c "
import asyncio
from jira.jira_field_api import fetch_and_save_allowed_values
asyncio.run(fetch_and_save_allowed_values())
"Files are written to data/{domain_slug}/ and used to seed the pgvector tables on next startup. AtlasMind will also fetch them automatically on first run if they are absent.
Set the following environment variables (or rely on the defaults in settings.py):
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgresql://postgres:postgres@localhost:5432/jql_vectordb |
pgvector connection string |
EMBEDDING_MODEL |
BAAI/bge-small-en-v1.5 |
SentenceTransformer model name |
LLM_BACKEND |
ollama |
LLM backend: ollama, groq, vllm, claude, or bedrock (overrides --model when set) |
JQL_OLLAMA_URL |
http://localhost:11434 |
Ollama base URL |
JQL_LOCAL_MODEL |
qwen2.5:3b-instruct-q4_K_M |
Ollama model to use |
JQL_OLLAMA_TIMEOUT |
120 |
Read timeout in seconds for LLM inference |
GROQ_API_KEY |
— | Groq API key (local dev) |
GROQ_API_KEY_OCID |
— | OCI Vault secret OCID for GROQ_API_KEY (takes priority over GROQ_API_KEY) |
GROQ_MODEL |
meta-llama/llama-4-scout-17b-16e-instruct |
Groq model name |
JQL_ANNOTATION_FILE |
data/jira_jql_annotated_queries.md |
Path to JQL annotation file |
MAX_JIRA_RESULTS |
2000 |
Maximum number of Jira issues fetched per query (paginated automatically) |
JQL_MAX_ATTEMPTS |
4 |
Total JQL attempts per query: 1 initial + (JQL_MAX_ATTEMPTS − 1) retries on Jira validation errors |
MAX_INTENT_FIELDS |
5 |
Maximum extra fields the LLM may propose per query |
STANDARD_FIELD_IDS |
key,summary,assignee,priority,issuetype,created,resolutiondate |
Comma-separated list of Jira field IDs always shown in results — override per project or Docker deployment |
VALUE_AUTO_CORRECT_THRESHOLD |
0.15 |
Cosine distance below which the sanitizer silently auto-corrects a bad value to the nearest known allowed value (e.g. typo correction) |
VALUE_HINT_THRESHOLD |
0.40 |
Cosine distance threshold for JQL value correction — bad values within this distance of a known allowed value are flagged |
VALUE_HINT_MAX_CANDIDATES |
3 |
Maximum candidate values surfaced per field for JQL sanitizer corrections |
VALUE_PROMPT_MAX_CANDIDATES |
3 |
Maximum candidate values injected into the retry prompt as hints for the LLM |
EMBEDDING_BATCH_SIZE |
256 |
Batch size for SentenceTransformer encoding during seeding — higher values reduce seeding time on CPU/GPU |
MAX_VALUES_FOR_EMBEDDING |
50 |
Maximum allowed values embedded per field in jira_field_values. High-cardinality fields (versions, components) are capped here; the in-memory exact-match dict always holds all values so casing correction is unaffected |
VLLM_URL |
— | vLLM server base URL (e.g. http://100.x.x.x:8002) |
VLLM_FALLBACK |
ollama |
Backend to use if vLLM is unreachable at startup (ollama, groq, claude, bedrock) |
VLLM_TIMEOUT |
240 |
Read timeout in seconds for vLLM inference |
VLLM_MAX_TOKENS |
— | Max tokens for vLLM responses |
VLLM_API_KEY |
— | API key if the vLLM server requires authentication |
CLAUDE_API_KEY |
— | Anthropic API key (local dev); used when --model claude |
CLAUDE_API_KEY_OCID |
— | OCI Vault secret OCID for CLAUDE_API_KEY (takes priority) |
CLAUDE_MODEL |
claude-sonnet-4-6 |
Anthropic model name |
AWS_BEARER_TOKEN_BEDROCK |
— | Bearer token for the Bedrock-compatible endpoint; used when --model bedrock |
CUSTOM_ENDPOINT |
— | Bedrock-compatible API endpoint URL — required for --model bedrock |
BEDROCK_REGION |
custom |
Region name passed to boto3 (gateway may override internally) |
BEDROCK_MODEL |
claude-sonnet-4.6 |
Model ID sent to the Bedrock endpoint |
All modes are accessed through app.py.
Setting env variables on Windows The
VAR=value commandinline syntax is Linux/macOS only. On Windows use:
- CMD:
set JQL_MAX_ATTEMPTS=5 && uv run python app.py --server- PowerShell:
$env:JQL_MAX_ATTEMPTS=5; uv run python app.py --server
uv run python app.py --query # local Ollama (default)
uv run python app.py --query --model groq # Groq cloud
uv run python app.py --query --model vllm # vLLM inference server
uv run python app.py --query --model claude # Anthropic Claude direct
uv run python app.py --query --model bedrock # AWS Bedrock-compatible endpointStarts a Rich terminal loop with the AtlasMind banner. Type a natural language query and press Enter to get JQL and an answer.
[atlasmind]> list open bugs assigned to me
Route : JQL pipeline
JQL : assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC
Chart : {"type": "bar", "x_field": "status", "y_field": "count", "title": "Open bugs by status"}
Answer : Open bugs currently assigned to you
Response time : 2.34s
General questions are answered directly without going through the JQL pipeline:
[atlasmind]> what is the difference between a bug and a task?
Route : General answer
Answer : A bug represents a defect or unexpected behaviour in the software...
Response time : 0.81s
REPL commands:
| Command | Description |
|---|---|
am help |
Show example queries and command list |
am history |
Show query history for this session |
exit / quit / q / am quit |
Exit the REPL |
Ctrl+C at prompt |
Exit cleanly |
Ctrl+C during query |
Interrupt the current query, return to prompt |
uv run python app.py --query "list open bugs assigned to me"Runs one query, prints JQL and Answer, then exits. Useful for scripting.
uv run python app.py --server # Ollama backend, port 8000
uv run python app.py --server --model groq --port 9000 # Groq backend, port 9000
uv run python app.py --server --model vllm --port 9000 # vLLM backend, port 9000
uv run python app.py --server --model claude # Anthropic Claude direct
uv run python app.py --server --model bedrock # AWS Bedrock-compatible endpointStarts the REST API on http://0.0.0.0:8000.
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Liveness check — returns {"status": "ok"} |
GET |
/meta |
Server metadata: active model name, LLM backend, and timeout |
GET |
/query |
Generate JQL from natural language (query via q URL param) |
POST |
/query |
Same as GET but query in request body ({"query": "..."}) |
POST |
/event |
Client events: {"event": "cancel", "request_id": "..."} to abort an in-flight query |
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "list open bugs assigned to me"}'{
"jql": "assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC",
"chart_spec": {"type": "bar", "x_field": "status", "y_field": "count", "title": "Open bugs by status"},
"answer": "Open bugs currently assigned to you"
}The query router automatically classifies each query as JQL or general. If the router misclassifies a query, or you need to control which Jira search API is used, append a flag to your query:
| Flag | Effect |
|---|---|
/jql |
Forces the JQL pipeline regardless of LLM classification |
/general |
Forces the general answer path, skipping the JQL pipeline |
/cloud |
Forces POST /rest/api/3/search/jql (Jira Cloud API) for this request |
/server |
Forces GET /rest/api/2/search (Jira Server API) for this request |
All flags are stripped from the query before it is sent to the LLM, so they do not affect the generated JQL or answer.
/cloud and /server override the search method and path for the current request only — they do not switch the active profile or change the Jira base URL. Use them when the active profile's jira_type is correct but you need to temporarily force a different API version. Flags can be combined:
[atlasmind]> list open issues in KAFKA /cloud
[atlasmind]> project = KAFKA AND status = Open /raw /cloud
Examples:
[atlasmind]> how many states are there in India /general
Route : General answer
Answer : What is the definition of an atom?
[atlasmind]> list issues in KAFKA /jql
Route : JQL pipeline
JQL : project = KAFKA ORDER BY created DESC
Overrides work across all LLM backends (Ollama, Groq, vLLM, Claude, Bedrock).
Data flow:
JQL_Embeddings.run()seeds pgvector with(annotation, JQL)pairs parsed from the annotation fileJira_Field_Embeddings.run()seeds pgvector with Jira field metadata (name, type, allowed values) — auto-fetched from the Jira REST API on first run if the file is absent- At query time,
QueryRoutermakes a single fast LLM call to classify the query:- General query → answered immediately; no embeddings or Jira API calls
- JQL query → full RAG pipeline: encode → similarity search → prompt → LLM → Jira API
- The assembled prompt is split on the
## Available Jira Fieldsmarker and sent to the active LLM. For Claude and Bedrock the stable system instructions (before the marker) are sent as a cached system block — billed at ~90% less on subsequent requests in the same session. For Groq the split maps to OpenAIsystem/userroles. Ollama and vLLM receive the full prompt as a single string. Actual token counts (including cache hits) are logged from every API response - LLM returns structured JSON with
jql,intent_fields,chart_spec, andanswer intent_fields(LLM-proposed display columns) are resolved viaFieldResolver. Exact name matches are tried first; unknown names fall back to embedding similarity search against the field metadata vector store — catching LLM variants likefixVersion→Fix Version/swithout any extra LLM call- JQL is validated by
JqlSanitizerbefore execution. Invalid field values are detected by cosine similarity against thejira_field_valuesvector store and corrected deterministically — no LLM call needed for known-value fields - The validated JQL is executed against the Jira REST API. On failure, the Jira error is appended to the accumulated retry prompt and the LLM is asked to correct the JQL — each successive retry carries the full failure history of all prior attempts so the model sees every error at once. Up to
JQL_MAX_ATTEMPTStotal attempts (default 4). Certain errors are fixed deterministically without an LLM call: invalid field values are stripped,comment IS NOT EMPTYis rewritten tocomment ~ '.', and unsupportedIS [NOT] EMPTYoperators on fields likeissueLinkTypeare stripped inline - Token usage (system prompt, field block, examples, and cumulative retry tokens) is tracked per query and returned in
token_usageon every response — including error responses
Both seeding steps are hash-gated — re-encoding is skipped if the source files have not changed since the last run.
A third vector table (jira_field_values) stores one embedding per (field_id, allowed_value) pair. It is seeded from the jira_allowed_values.json file at startup and used by the JqlSanitizer for pure-DB value correction — no LLM call, no token cost. High-cardinality fields are capped at MAX_VALUES_FOR_EMBEDDING values (default 50) to keep seeding fast; the full value list is still held in-memory for exact-match correction. The seed key encodes the cap value (::cap50) so changing MAX_VALUES_FOR_EMBEDDING automatically triggers a re-seed on next startup without any manual DB intervention.
A fourth vector table (jira_asset_values) stores one embedding per (field_id, label) pair for Jira Assets (formerly Insight) object fields. Asset object labels (e.g. "Sample Domain (ABCD-1234)") come from a different API than standard Jira field options and are stored in a dedicated table. At prompt-assembly time, the closest asset labels to the user query are injected as value hints so the LLM generates the exact label string on the first pass. Asset fields are optional — if jira_assets.json is absent or the Assets API is unreachable, the server starts normally without asset hints; an error is logged but startup is never blocked.
Jira fields are stored per domain under data/{domain_slug}/ (e.g. data/issues_apache_org/jira_fields.json). Switching the active profile in config/profiles.json automatically uses the correct set of files for that Jira instance.
Key files:
| File | Role |
|---|---|
app.py |
CLI entry point — --query (REPL / single-shot), --server, --model, --host, --port |
server.py |
FastAPI app with /health and /query endpoints |
core/atlasmind.py |
Top-level orchestrator — run() seeds both DBs, generate_jql() is the query entry point |
core/router.py |
Two-stage query router — fast LLM classify before triggering RAG pipeline |
core/ollama_client.py |
Sync test_connection() and async generate_jql() against the Ollama API |
core/groq_client.py |
Async Groq REST client (OpenAI-compatible); splits prompt into system / user roles at the ## Available Jira Fields marker; logs token usage; used when --model groq |
core/vllm_client.py |
Async vLLM REST client (OpenAI-compatible); auto-detects model from /v1/models; used when --model vllm |
core/claude_client.py |
Async Anthropic SDK client; caches system prompt via cache_control: ephemeral + anthropic-beta header; logs input/output/cache token counts; used when --model claude |
core/bedrock_claude_client.py |
boto3 converse() client for Bedrock-compatible endpoints; caches system prompt via cacheConfig: default in the system block; used when --model bedrock |
core/jira_auth.py |
Per-request Jira auth — X-Jira-Token and X-Jira-Url FastAPI dependencies; JiraProfile / JiraCredential Pydantic models |
cloud/oci_vault.py |
OCI Vault secret fetching via Instance Principal; fallback to plain env var |
rag/jql_embeddings.py |
Seeds and searches the JQL annotation pgvector table |
rag/jira_field_embeddings.py |
Seeds and searches the Jira field metadata pgvector table; find_similar_field_name() provides embedding fallback for unknown intent field names |
rag/jira_field_value_embeddings.py |
Seeds and searches the jira_field_values pgvector table — one embedding per (field_id, allowed_value) pair; used by JqlSanitizer for value correction without LLM calls |
rag/jira_asset_embeddings.py |
Seeds and searches the jira_asset_values pgvector table — one embedding per (field_id, label) pair for Jira Assets object fields; used to inject query-ranked asset labels as value hints before LLM generation |
core/jql_sanitizer.py |
Deterministic JQL pre-execution corrections: strips invalid field values, rewrites unsupported operators, injects value-hint candidates into retry prompts |
jira/jira_field_api.py |
Fetches field metadata and allowed values from the Jira REST API |
jira/jira_assets_api.py |
Fetches Jira Assets object labels via the Assets AQL API; list_asset_fields() prints detected Insight/Assets custom fields from the cached jira_fields.json |
rag/seed_manager.py |
MD5 hash-based seeding gate stored in a seed_metadata pgvector table |
config/profiles.json |
Jira connection profiles (URL, credentials); default key selects the active one |
config/system_prompt.md |
JQL-only system prompt (general answers handled by router) |
config/router_prompt.md |
Router prompt template with Jira vocabulary list and few-shot examples |
settings.py |
All defaults and env-overridable settings for both Ollama and Groq backends |
Edit config/profiles.json to configure your Jira instance:
{
"default": "work",
"profiles": {
"work": {
"jira_url": "https://issues.apache.org/jira",
"email": "",
"token": "",
"jira_type": "server",
"search_path": ""
},
"personal": {
"jira_url": "https://myorg.atlassian.net",
"email": "me@example.com",
"token": "",
"jira_type": "cloud",
"search_path": ""
}
}
}Change "default" to switch the active instance. Jira fields are auto-fetched and stored in data/{domain_slug}/ on first run.
Controls the Jira search API used for every query:
jira_type |
Method | Endpoint | Pagination |
|---|---|---|---|
cloud |
POST |
/rest/api/3/search/jql |
Cursor (nextPageToken) |
server |
GET |
/rest/api/2/search |
Offset (startAt) |
Defaults to cloud when omitted.
Optional override for the search endpoint path. Leave empty to use the default for jira_type. Set this to change the search path without rebuilding or redeploying — useful if the Jira API version changes:
"search_path": "/rest/api/4/search/jql"When search_path is set, it overrides the default path but jira_type still determines the HTTP method and pagination strategy (cloud → POST + cursor, server → GET + offset).
Frontends can override credentials per request using HTTP headers — no server restart needed:
| Header | Description |
|---|---|
X-Jira-Token |
PAT or API token; takes precedence over the profile token field |
X-Jira-Url |
Jira base URL; overrides the profile jira_url (must be a valid http/https URL) |
Both headers are optional. When absent, the active profile values are used as fallback.
If your Jira instance uses the Assets module (formerly Insight), AtlasMind automatically detects Assets-type fields, fetches their object labels, and embeds them so the LLM generates the correct aqlFunction JQL pattern instead of guessing a raw value.
For example, a user query like "show issues in domain Sample Domain" generates:
"Domain" IN aqlFunction('Name = "Sample Domain"')
instead of the incorrect domain = "Sample Domain" that a plain LLM would produce.
At startup, AtlasMind:
- Reads
config/jira_assets_fields.jsonto get asset field detection keywords (default:[".insight", ".cmdb"]) — read at runtime, no rebuild needed. - Reads
jira_fields.jsonand detects every field whoseschema.customcontains any of the configured keywords — the definitive Jira Assets/Insight indicator. - Fetches all object labels for each detected field from the Jira Assets AQL API (
GET /rest/assets/1.0/object/aql). - Writes the results to
data/<hostname>/jira_assets.jsonand seeds thejira_asset_valuespgvector table. - On subsequent startups, re-fetch is skipped if
jira_fields.jsonhas not changed (hash-gated).
No manual configuration is required. A log line confirms the load:
INFO Detected 1 Assets field(s): customfield_10200
INFO Asset fields loaded: 1 field(s) — customfield_10200
If the Assets API is unreachable, the server starts normally without asset hints and logs an error — startup is never blocked.
Run this after bulk changes to asset objects in Jira (bypasses the hash gate):
uv run python -c "
import asyncio
from jira.jira_assets_api import refresh_asset_values
asyncio.run(refresh_asset_values())
"AtlasMind detects Assets/Insight fields by looking for keywords in schema.custom. Default keywords: [".insight", ".cmdb"].
To add a custom keyword (e.g., for a vendor-specific plugin), edit config/jira_assets_fields.json:
{
"asset_field_keywords": [".insight", ".cmdb", ".custom-plugin"]
}Keywords are read from the config file at runtime on every startup — no rebuild or re-deploy needed. Restart AtlasMind after changing this file to re-seed the asset vector table with the updated detection rules.
By default, AtlasMind uses the field display name as the AQL object type (e.g. field named "Domain" → objectType = "Domain"). This covers most cases. When the display name differs from the AQL object type, add an override to config/jira_assets_fields.json:
{
"customfield_10200": {
"display_name": "Domain",
"object_type": "CustomerDomain"
}
}Entries in this file take precedence over auto-detected values. If no override file exists, auto-detection runs without it.
uv run python -c "from jira.jira_assets_api import list_asset_fields; list_asset_fields()"Prints all Assets-type fields detected in jira_fields.json with their field IDs and schema keys.
The /query endpoint returns a QueryResponse Pydantic model:
class QueryResponse(BaseModel):
type: str # "jql" or "general"
profile: str # active Jira profile name
jira_base_url: str
jira_type: str | None # effective search API: "cloud" or "server"
answer: str | None
jql: str | None # None for general queries
total: int # total matching issues in Jira
shown: int # issues returned in this response
display_fields: list[str] # ordered column headers for the frontend
issues: list[dict] # normalised issue dicts
chart_spec: ChartSpec | None
filters: dict[str, list[str]] | None # facet values for filter dropdowns
meta: ServerMeta | None # model name, backend, timeout
token_usage: TokenUsage | None # prompt token estimates for this queryjira_type reflects the search API actually used for this response — either the profile default or the per-request /cloud//server override. The UI can use this to display which Jira API version was active, or to adapt behaviour for cloud vs server responses.
TokenUsage breaks down prompt size per query:
class TokenUsage(BaseModel):
system_tokens: int # system prompt character count ÷ 4
fields_tokens: int # field context block ÷ 4
examples_tokens: int # RAG examples block ÷ 4
total_tokens: int # total prompt tokens for the initial LLM call
retry_tokens: int # cumulative tokens added across all retry extensionstoken_usage is present on every response including error responses, so the frontend can track cost even when a query fails.
For general (non-Jira) questions, jql and chart_spec are None and answer contains the plain-text response.
For JQL queries, answer always includes a result-count suffix appended by the server after the Jira search completes — for example Found 42 result(s)., Found 500 result(s); showing 500. (when paginated), or No results found.
Markdown file with /* comment */\nJQL pairs used as few-shot examples:
/* open bugs assigned to me */
assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC
/* high priority tickets created this week */
priority = High AND created >= startOfWeek() ORDER BY created DESC
Fetched automatically on first run from /rest/api/2/field. Keyed by field ID. A companion jira_allowed_values.json is also fetched and merged in to enrich descriptions with discrete option lists (e.g. status values, issue types).
Contains asset_field_keywords (keywords to detect Assets/Insight fields in schema.custom) and optional per-field object type overrides. See the Jira Assets section.
Written automatically at startup by the Assets auto-detect flow. Contains all object labels per detected asset field. Re-run refresh_asset_values() whenever asset objects change in Jira — the server detects the hash change and re-seeds the jira_asset_values table on next startup.
AtlasMind on OCI A1 can offload all LLM inference to a local GPU system over Tailscale. Only vLLM needs to run on the GPU system — no database, no AtlasMind installation required there.
| Machine | What runs |
|---|---|
| GPU system | vLLM only — serves the model over HTTP |
| OCI A1 (always-on) | AtlasMind + Postgres + Ollama (fallback) + frontend |
AtlasMind on OCI A1 sends prompts to vLLM on the GPU system over Tailscale. When the GPU system is off, AtlasMind falls back to its local Ollama automatically.
vLLM does not run natively on Windows. You need WSL2 with Ubuntu.
Open PowerShell as Administrator and run:
wsl --installRestart when prompted. After restart, Ubuntu opens and asks you to create a username and password. This is your Linux environment — all remaining steps run inside WSL2.
To open WSL2 later: search for Ubuntu in the Start menu, or run wsl in any terminal.
The NVIDIA driver is automatically bridged from Windows into WSL2 — no separate CUDA toolkit installation needed. vLLM's pip package bundles the CUDA runtime libraries it needs.
Run inside WSL2:
nvidia-smiYou should see your GPU listed with driver version and VRAM. If this command fails, reinstall the latest NVIDIA driver on Windows first, then retry.
Ubuntu 24.04 does not allow system-wide pip installs. Use a virtual environment:
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install vllmAfter activation you will see (vllm-env) in your prompt. This download is large (~5 GB) — let it complete fully before continuing.
Always activate the environment before running vLLM in future sessions:
source ~/vllm-env/bin/activateWith 8 GB VRAM, use a quantized 7B model. AWQ quantization gives the best quality-to-size ratio and is natively supported by vLLM.
Recommended for AtlasMind (reliable structured JSON and JQL output):
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--port 8002 \
--host 0.0.0.0Qwen2.5-Coder is preferred over the general instruct variant because JQL is a query language (similar to SQL). The Coder model is trained on code and structured DSLs, making it more reliable at generating syntactically correct JQL and strictly following the JSON output format (jql, intent_fields, chart_spec, answer).
--gpu-memory-utilization 0.85 reserves 85% of VRAM for vLLM. The default is 0.9 (90%) which can exceed available VRAM on 8 GB cards due to Windows/WSL2 overhead. Lower to 0.80 if startup still fails.
--max-model-len 8192 caps the context window at 8192 tokens. The model's default (32768) requires more KV cache than fits in 8 GB after loading weights. 8192 is sufficient for AtlasMind — typical prompts (system prompt + RAG examples + query) are 1500–2500 tokens.
On first run, this downloads the model weights from HuggingFace (~4.5 GB). Subsequent runs load from the local cache. Wait until you see:
INFO: Application startup complete.
The server is now listening on port 8002.
Alternative models (all fit in 8 GB VRAM with AWQ):
| Model | VRAM | Notes |
|---|---|---|
Qwen/Qwen2.5-7B-Instruct-AWQ |
~4.5 GB | General instruct, solid fallback |
meta-llama/Llama-3.1-8B-Instruct-AWQ |
~5.5 GB | Strong reasoning, good alternative |
From WSL2, confirm the API responds:
curl http://localhost:8002/v1/modelsYou should see a JSON response listing the loaded model name.
On the OCI A1 machine, set the following environment variables before starting AtlasMind:
export VLLM_URL=http://<gpu-system-tailscale-ip>:8002Replace <gpu-system-tailscale-ip> with the GPU system's Tailscale IP address (find it by running tailscale ip in PowerShell on the GPU system, or clicking the Tailscale tray icon).
Then start AtlasMind with the vLLM backend:
uv run python app.py --server --model vllmAtlasMind auto-detects the loaded model from vLLM's /v1/models endpoint — no need to set the model name explicitly.
WSL2 shuts down when you close the terminal. To keep vLLM running in the background:
source ~/vllm-env/bin/activate
nohup vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--port 8002 \
--host 0.0.0.0 > ~/vllm.log 2>&1 &Logs go to ~/vllm.log. Check them with tail -f ~/vllm.log.
Tailscale creates a private network between your GPU system and OCI A1, so AtlasMind can reach vLLM securely without exposing any ports to the internet.
Download and install Tailscale from tailscale.com/download. Run the installer and sign in with your Tailscale account (Google, GitHub, or Microsoft login).
Once signed in, Tailscale assigns your Windows machine a private IP in the 100.x.x.x range. You will see the Tailscale icon in the system tray.
Edit (or create) C:\Users\<username>\.wslconfig and add:
[wsl2]
networkingMode=mirrored
firewall=falseRestart WSL2 to apply:
wsl --shutdownnetworkingMode=mirrored makes WSL2 share the Windows network stack directly — vLLM is reachable at the Windows machine's IP without any port proxy. firewall=false disables the WSL2 Hyper-V firewall layer, which otherwise blocks inbound connections independently of other firewall rules.
Add an inbound allow rule for TCP port 8002.
If you use Windows Defender Firewall only:
- Press
Win + R→ typewf.msc→ Enter - Inbound Rules → New Rule → Port → TCP → 8002 → Allow the connection → All profiles → Finish
If you use a third-party firewall suite (e.g. Norton 360, McAfee):
Third-party firewall suites include their own firewall engine that runs alongside Windows Defender Firewall. Add the port 8002 allow rule in your firewall suite's settings — for Norton: Settings → Firewall → Traffic Rules → Add → Action: Allow, Direction: Inbound, Protocol: TCP, Local port: 8002, Profile: All.
Note: If inbound connections are still blocked after adding the rule, both firewall engines may be active simultaneously and conflicting. If your third-party suite is the intended firewall, disable Windows Defender Firewall so only one engine is enforcing rules. Run the following in PowerShell as Administrator:
Set-NetFirewallProfile -Profile Domain,Public,Private -Enabled FalseThis disables Windows Defender Firewall across all profiles. Your third-party firewall (Norton, McAfee, etc.) remains active.
WSL2 resets completely on every shutdown (wsl --shutdown, PC restart, or closing the terminal) — all running processes including vLLM are killed. You must restart vLLM each time WSL2 comes back up.
To make this less tedious, add a shell alias to your ~/.bashrc:
echo "alias start-vllm='source ~/vllm-env/bin/activate && vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --quantization awq --gpu-memory-utilization 0.85 --max-model-len 8192 --port 8002 --host 0.0.0.0'" >> ~/.bashrc
source ~/.bashrcThen to start vLLM in any future session:
start-vllmOr in the background:
start-vllm > ~/vllm.log 2>&1 &Then follow the logs:
tail -f ~/vllm.logIn PowerShell on Windows, run:
tailscale ipOr click the Tailscale system tray icon — your IP is shown at the top. It will look like 100.x.x.x.
Note this IP — you will set it as VLLM_URL on OCI A1.
On the OCI A1 instance, run:
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale upFollow the authentication link printed in the terminal to connect OCI A1 to the same Tailscale account. Once authenticated, OCI A1 and your GPU system are on the same private network.
From OCI A1, confirm it can reach vLLM on the GPU system (replace with your actual Tailscale IP):
curl http://100.x.x.x:8002/v1/modelsYou should get back a JSON response listing the loaded model. If the request times out, check that:
- vLLM is running in WSL2 with
--host 0.0.0.0 .wslconfighasnetworkingMode=mirroredandfirewall=false, and WSL2 was restarted after the change- Both machines show as Connected in the Tailscale admin console at login.tailscale.com
- The firewall allow rule for port 8002 is in place (Step 3)
- If using a third-party firewall suite, check whether both firewall engines are conflicting (see Step 3 note)
On OCI A1, set the Tailscale IP before starting the server:
export VLLM_URL=http://100.x.x.x:8002
uv run python app.py --server --model vllmuv run python -m pytest tests/ -v