Skip to content

sunishbharat/atlasMind-Lite

Repository files navigation

Atlasmind-Lite

A natural language to JQL (Jira Query Language) generator using RAG (Retrieval-Augmented Generation) with pgvector. Supports multiple LLM backends: local Ollama, vLLM (GPU inference server), Groq cloud, Anthropic Claude direct API, and AWS Bedrock-compatible endpoints. Returns structured JSON with a JQL query, a chart specification, and a plain-text answer. A two-stage router answers general questions immediately without touching the JQL pipeline. Try it live: atlasmind.de

Preview

Image

Prerequisites

  • PostgreSQL with the pgvector extension
  • One of the following LLM backends:
    • Ollama running locally with a model loaded (default: qwen2.5:3b-instruct-q4_K_M)
    • A Groq API key (GROQ_API_KEY)
    • A vLLM inference server (VLLM_URL)
    • An Anthropic API key (CLAUDE_API_KEY) for Claude direct
    • An AWS Bedrock-compatible endpoint + bearer token for --model bedrock
  • Python 3.12+, uv

Setup

uv sync

One-time Jira field fetch

Run these once against your active Jira profile before starting the server for the first time, or after switching to a new Jira instance:

# Fetch all Jira field metadata and cache locally
uv run python -c "from jira.jira_field_api import fetch_and_save_fields; fetch_and_save_fields()"

# Fetch allowed values for all eligible fields (status, priority, issue types, custom options)
uv run python -c "
import asyncio
from jira.jira_field_api import fetch_and_save_allowed_values
asyncio.run(fetch_and_save_allowed_values())
"

Files are written to data/{domain_slug}/ and used to seed the pgvector tables on next startup. AtlasMind will also fetch them automatically on first run if they are absent.

Set the following environment variables (or rely on the defaults in settings.py):

Variable Default Description
DATABASE_URL postgresql://postgres:postgres@localhost:5432/jql_vectordb pgvector connection string
EMBEDDING_MODEL BAAI/bge-small-en-v1.5 SentenceTransformer model name
LLM_BACKEND ollama LLM backend: ollama, groq, vllm, claude, or bedrock (overrides --model when set)
JQL_OLLAMA_URL http://localhost:11434 Ollama base URL
JQL_LOCAL_MODEL qwen2.5:3b-instruct-q4_K_M Ollama model to use
JQL_OLLAMA_TIMEOUT 120 Read timeout in seconds for LLM inference
GROQ_API_KEY Groq API key (local dev)
GROQ_API_KEY_OCID OCI Vault secret OCID for GROQ_API_KEY (takes priority over GROQ_API_KEY)
GROQ_MODEL meta-llama/llama-4-scout-17b-16e-instruct Groq model name
JQL_ANNOTATION_FILE data/jira_jql_annotated_queries.md Path to JQL annotation file
MAX_JIRA_RESULTS 2000 Maximum number of Jira issues fetched per query (paginated automatically)
JQL_MAX_ATTEMPTS 4 Total JQL attempts per query: 1 initial + (JQL_MAX_ATTEMPTS − 1) retries on Jira validation errors
MAX_INTENT_FIELDS 5 Maximum extra fields the LLM may propose per query
STANDARD_FIELD_IDS key,summary,assignee,priority,issuetype,created,resolutiondate Comma-separated list of Jira field IDs always shown in results — override per project or Docker deployment
VALUE_AUTO_CORRECT_THRESHOLD 0.15 Cosine distance below which the sanitizer silently auto-corrects a bad value to the nearest known allowed value (e.g. typo correction)
VALUE_HINT_THRESHOLD 0.40 Cosine distance threshold for JQL value correction — bad values within this distance of a known allowed value are flagged
VALUE_HINT_MAX_CANDIDATES 3 Maximum candidate values surfaced per field for JQL sanitizer corrections
VALUE_PROMPT_MAX_CANDIDATES 3 Maximum candidate values injected into the retry prompt as hints for the LLM
EMBEDDING_BATCH_SIZE 256 Batch size for SentenceTransformer encoding during seeding — higher values reduce seeding time on CPU/GPU
MAX_VALUES_FOR_EMBEDDING 50 Maximum allowed values embedded per field in jira_field_values. High-cardinality fields (versions, components) are capped here; the in-memory exact-match dict always holds all values so casing correction is unaffected
VLLM_URL vLLM server base URL (e.g. http://100.x.x.x:8002)
VLLM_FALLBACK ollama Backend to use if vLLM is unreachable at startup (ollama, groq, claude, bedrock)
VLLM_TIMEOUT 240 Read timeout in seconds for vLLM inference
VLLM_MAX_TOKENS Max tokens for vLLM responses
VLLM_API_KEY API key if the vLLM server requires authentication
CLAUDE_API_KEY Anthropic API key (local dev); used when --model claude
CLAUDE_API_KEY_OCID OCI Vault secret OCID for CLAUDE_API_KEY (takes priority)
CLAUDE_MODEL claude-sonnet-4-6 Anthropic model name
AWS_BEARER_TOKEN_BEDROCK Bearer token for the Bedrock-compatible endpoint; used when --model bedrock
CUSTOM_ENDPOINT Bedrock-compatible API endpoint URL — required for --model bedrock
BEDROCK_REGION custom Region name passed to boto3 (gateway may override internally)
BEDROCK_MODEL claude-sonnet-4.6 Model ID sent to the Bedrock endpoint

Running the app

All modes are accessed through app.py.

Setting env variables on Windows The VAR=value command inline syntax is Linux/macOS only. On Windows use:

  • CMD: set JQL_MAX_ATTEMPTS=5 && uv run python app.py --server
  • PowerShell: $env:JQL_MAX_ATTEMPTS=5; uv run python app.py --server

Interactive REPL

uv run python app.py --query                    # local Ollama (default)
uv run python app.py --query --model groq       # Groq cloud
uv run python app.py --query --model vllm       # vLLM inference server
uv run python app.py --query --model claude     # Anthropic Claude direct
uv run python app.py --query --model bedrock    # AWS Bedrock-compatible endpoint

Starts a Rich terminal loop with the AtlasMind banner. Type a natural language query and press Enter to get JQL and an answer.

[atlasmind]> list open bugs assigned to me

  Route   : JQL pipeline
  JQL     : assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC
  Chart   : {"type": "bar", "x_field": "status", "y_field": "count", "title": "Open bugs by status"}
  Answer  : Open bugs currently assigned to you
  Response time : 2.34s

General questions are answered directly without going through the JQL pipeline:

[atlasmind]> what is the difference between a bug and a task?

  Route   : General answer
  Answer  : A bug represents a defect or unexpected behaviour in the software...
  Response time : 0.81s

REPL commands:

Command Description
am help Show example queries and command list
am history Show query history for this session
exit / quit / q / am quit Exit the REPL
Ctrl+C at prompt Exit cleanly
Ctrl+C during query Interrupt the current query, return to prompt

Single-shot query

uv run python app.py --query "list open bugs assigned to me"

Runs one query, prints JQL and Answer, then exits. Useful for scripting.

FastAPI server

uv run python app.py --server                             # Ollama backend, port 8000
uv run python app.py --server --model groq --port 9000    # Groq backend, port 9000
uv run python app.py --server --model vllm --port 9000    # vLLM backend, port 9000
uv run python app.py --server --model claude              # Anthropic Claude direct
uv run python app.py --server --model bedrock             # AWS Bedrock-compatible endpoint

Starts the REST API on http://0.0.0.0:8000.

Method Endpoint Description
GET /health Liveness check — returns {"status": "ok"}
GET /meta Server metadata: active model name, LLM backend, and timeout
GET /query Generate JQL from natural language (query via q URL param)
POST /query Same as GET but query in request body ({"query": "..."})
POST /event Client events: {"event": "cancel", "request_id": "..."} to abort an in-flight query
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "list open bugs assigned to me"}'
{
  "jql": "assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC",
  "chart_spec": {"type": "bar", "x_field": "status", "y_field": "count", "title": "Open bugs by status"},
  "answer": "Open bugs currently assigned to you"
}

Routing overrides

The query router automatically classifies each query as JQL or general. If the router misclassifies a query, or you need to control which Jira search API is used, append a flag to your query:

Flag Effect
/jql Forces the JQL pipeline regardless of LLM classification
/general Forces the general answer path, skipping the JQL pipeline
/cloud Forces POST /rest/api/3/search/jql (Jira Cloud API) for this request
/server Forces GET /rest/api/2/search (Jira Server API) for this request

All flags are stripped from the query before it is sent to the LLM, so they do not affect the generated JQL or answer.

/cloud and /server override the search method and path for the current request only — they do not switch the active profile or change the Jira base URL. Use them when the active profile's jira_type is correct but you need to temporarily force a different API version. Flags can be combined:

[atlasmind]> list open issues in KAFKA /cloud
[atlasmind]> project = KAFKA AND status = Open /raw /cloud

Examples:

[atlasmind]> how many states are there in India /general

  Route   : General answer
  Answer  : What is the definition of an atom?

[atlasmind]> list issues in KAFKA /jql

  Route   : JQL pipeline
  JQL     : project = KAFKA ORDER BY created DESC

Overrides work across all LLM backends (Ollama, Groq, vLLM, Claude, Bedrock).

Architecture

Data flow:

  1. JQL_Embeddings.run() seeds pgvector with (annotation, JQL) pairs parsed from the annotation file
  2. Jira_Field_Embeddings.run() seeds pgvector with Jira field metadata (name, type, allowed values) — auto-fetched from the Jira REST API on first run if the file is absent
  3. At query time, QueryRouter makes a single fast LLM call to classify the query:
    • General query → answered immediately; no embeddings or Jira API calls
    • JQL query → full RAG pipeline: encode → similarity search → prompt → LLM → Jira API
  4. The assembled prompt is split on the ## Available Jira Fields marker and sent to the active LLM. For Claude and Bedrock the stable system instructions (before the marker) are sent as a cached system block — billed at ~90% less on subsequent requests in the same session. For Groq the split maps to OpenAI system / user roles. Ollama and vLLM receive the full prompt as a single string. Actual token counts (including cache hits) are logged from every API response
  5. LLM returns structured JSON with jql, intent_fields, chart_spec, and answer
  6. intent_fields (LLM-proposed display columns) are resolved via FieldResolver. Exact name matches are tried first; unknown names fall back to embedding similarity search against the field metadata vector store — catching LLM variants like fixVersionFix Version/s without any extra LLM call
  7. JQL is validated by JqlSanitizer before execution. Invalid field values are detected by cosine similarity against the jira_field_values vector store and corrected deterministically — no LLM call needed for known-value fields
  8. The validated JQL is executed against the Jira REST API. On failure, the Jira error is appended to the accumulated retry prompt and the LLM is asked to correct the JQL — each successive retry carries the full failure history of all prior attempts so the model sees every error at once. Up to JQL_MAX_ATTEMPTS total attempts (default 4). Certain errors are fixed deterministically without an LLM call: invalid field values are stripped, comment IS NOT EMPTY is rewritten to comment ~ '.', and unsupported IS [NOT] EMPTY operators on fields like issueLinkType are stripped inline
  9. Token usage (system prompt, field block, examples, and cumulative retry tokens) is tracked per query and returned in token_usage on every response — including error responses

Both seeding steps are hash-gated — re-encoding is skipped if the source files have not changed since the last run.

A third vector table (jira_field_values) stores one embedding per (field_id, allowed_value) pair. It is seeded from the jira_allowed_values.json file at startup and used by the JqlSanitizer for pure-DB value correction — no LLM call, no token cost. High-cardinality fields are capped at MAX_VALUES_FOR_EMBEDDING values (default 50) to keep seeding fast; the full value list is still held in-memory for exact-match correction. The seed key encodes the cap value (::cap50) so changing MAX_VALUES_FOR_EMBEDDING automatically triggers a re-seed on next startup without any manual DB intervention.

A fourth vector table (jira_asset_values) stores one embedding per (field_id, label) pair for Jira Assets (formerly Insight) object fields. Asset object labels (e.g. "Sample Domain (ABCD-1234)") come from a different API than standard Jira field options and are stored in a dedicated table. At prompt-assembly time, the closest asset labels to the user query are injected as value hints so the LLM generates the exact label string on the first pass. Asset fields are optional — if jira_assets.json is absent or the Assets API is unreachable, the server starts normally without asset hints; an error is logged but startup is never blocked.

Jira fields are stored per domain under data/{domain_slug}/ (e.g. data/issues_apache_org/jira_fields.json). Switching the active profile in config/profiles.json automatically uses the correct set of files for that Jira instance.

Key files:

File Role
app.py CLI entry point — --query (REPL / single-shot), --server, --model, --host, --port
server.py FastAPI app with /health and /query endpoints
core/atlasmind.py Top-level orchestrator — run() seeds both DBs, generate_jql() is the query entry point
core/router.py Two-stage query router — fast LLM classify before triggering RAG pipeline
core/ollama_client.py Sync test_connection() and async generate_jql() against the Ollama API
core/groq_client.py Async Groq REST client (OpenAI-compatible); splits prompt into system / user roles at the ## Available Jira Fields marker; logs token usage; used when --model groq
core/vllm_client.py Async vLLM REST client (OpenAI-compatible); auto-detects model from /v1/models; used when --model vllm
core/claude_client.py Async Anthropic SDK client; caches system prompt via cache_control: ephemeral + anthropic-beta header; logs input/output/cache token counts; used when --model claude
core/bedrock_claude_client.py boto3 converse() client for Bedrock-compatible endpoints; caches system prompt via cacheConfig: default in the system block; used when --model bedrock
core/jira_auth.py Per-request Jira auth — X-Jira-Token and X-Jira-Url FastAPI dependencies; JiraProfile / JiraCredential Pydantic models
cloud/oci_vault.py OCI Vault secret fetching via Instance Principal; fallback to plain env var
rag/jql_embeddings.py Seeds and searches the JQL annotation pgvector table
rag/jira_field_embeddings.py Seeds and searches the Jira field metadata pgvector table; find_similar_field_name() provides embedding fallback for unknown intent field names
rag/jira_field_value_embeddings.py Seeds and searches the jira_field_values pgvector table — one embedding per (field_id, allowed_value) pair; used by JqlSanitizer for value correction without LLM calls
rag/jira_asset_embeddings.py Seeds and searches the jira_asset_values pgvector table — one embedding per (field_id, label) pair for Jira Assets object fields; used to inject query-ranked asset labels as value hints before LLM generation
core/jql_sanitizer.py Deterministic JQL pre-execution corrections: strips invalid field values, rewrites unsupported operators, injects value-hint candidates into retry prompts
jira/jira_field_api.py Fetches field metadata and allowed values from the Jira REST API
jira/jira_assets_api.py Fetches Jira Assets object labels via the Assets AQL API; list_asset_fields() prints detected Insight/Assets custom fields from the cached jira_fields.json
rag/seed_manager.py MD5 hash-based seeding gate stored in a seed_metadata pgvector table
config/profiles.json Jira connection profiles (URL, credentials); default key selects the active one
config/system_prompt.md JQL-only system prompt (general answers handled by router)
config/router_prompt.md Router prompt template with Jira vocabulary list and few-shot examples
settings.py All defaults and env-overridable settings for both Ollama and Groq backends

Jira connection profiles

Edit config/profiles.json to configure your Jira instance:

{
  "default": "work",
  "profiles": {
    "work": {
      "jira_url": "https://issues.apache.org/jira",
      "email": "",
      "token": "",
      "jira_type": "server",
      "search_path": ""
    },
    "personal": {
      "jira_url": "https://myorg.atlassian.net",
      "email": "me@example.com",
      "token": "",
      "jira_type": "cloud",
      "search_path": ""
    }
  }
}

Change "default" to switch the active instance. Jira fields are auto-fetched and stored in data/{domain_slug}/ on first run.

jira_type

Controls the Jira search API used for every query:

jira_type Method Endpoint Pagination
cloud POST /rest/api/3/search/jql Cursor (nextPageToken)
server GET /rest/api/2/search Offset (startAt)

Defaults to cloud when omitted.

search_path

Optional override for the search endpoint path. Leave empty to use the default for jira_type. Set this to change the search path without rebuilding or redeploying — useful if the Jira API version changes:

"search_path": "/rest/api/4/search/jql"

When search_path is set, it overrides the default path but jira_type still determines the HTTP method and pagination strategy (cloud → POST + cursor, server → GET + offset).

Per-request auth headers

Frontends can override credentials per request using HTTP headers — no server restart needed:

Header Description
X-Jira-Token PAT or API token; takes precedence over the profile token field
X-Jira-Url Jira base URL; overrides the profile jira_url (must be a valid http/https URL)

Both headers are optional. When absent, the active profile values are used as fallback.

Jira Assets (optional)

If your Jira instance uses the Assets module (formerly Insight), AtlasMind automatically detects Assets-type fields, fetches their object labels, and embeds them so the LLM generates the correct aqlFunction JQL pattern instead of guessing a raw value.

For example, a user query like "show issues in domain Sample Domain" generates:

"Domain" IN aqlFunction('Name = "Sample Domain"')

instead of the incorrect domain = "Sample Domain" that a plain LLM would produce.

How it works

At startup, AtlasMind:

  1. Reads config/jira_assets_fields.json to get asset field detection keywords (default: [".insight", ".cmdb"]) — read at runtime, no rebuild needed.
  2. Reads jira_fields.json and detects every field whose schema.custom contains any of the configured keywords — the definitive Jira Assets/Insight indicator.
  3. Fetches all object labels for each detected field from the Jira Assets AQL API (GET /rest/assets/1.0/object/aql).
  4. Writes the results to data/<hostname>/jira_assets.json and seeds the jira_asset_values pgvector table.
  5. On subsequent startups, re-fetch is skipped if jira_fields.json has not changed (hash-gated).

No manual configuration is required. A log line confirms the load:

INFO  Detected 1 Assets field(s): customfield_10200
INFO  Asset fields loaded: 1 field(s) — customfield_10200

If the Assets API is unreachable, the server starts normally without asset hints and logs an error — startup is never blocked.

Forcing a refresh

Run this after bulk changes to asset objects in Jira (bypasses the hash gate):

uv run python -c "
import asyncio
from jira.jira_assets_api import refresh_asset_values
asyncio.run(refresh_asset_values())
"

Configuring asset field detection keywords

AtlasMind detects Assets/Insight fields by looking for keywords in schema.custom. Default keywords: [".insight", ".cmdb"].

To add a custom keyword (e.g., for a vendor-specific plugin), edit config/jira_assets_fields.json:

{
    "asset_field_keywords": [".insight", ".cmdb", ".custom-plugin"]
}

Keywords are read from the config file at runtime on every startup — no rebuild or re-deploy needed. Restart AtlasMind after changing this file to re-seed the asset vector table with the updated detection rules.

Overriding the object type name

By default, AtlasMind uses the field display name as the AQL object type (e.g. field named "Domain" → objectType = "Domain"). This covers most cases. When the display name differs from the AQL object type, add an override to config/jira_assets_fields.json:

{
    "customfield_10200": {
        "display_name": "Domain",
        "object_type": "CustomerDomain"
    }
}

Entries in this file take precedence over auto-detected values. If no override file exists, auto-detection runs without it.

Discovering asset field IDs

uv run python -c "from jira.jira_assets_api import list_asset_fields; list_asset_fields()"

Prints all Assets-type fields detected in jira_fields.json with their field IDs and schema keys.

Response model

The /query endpoint returns a QueryResponse Pydantic model:

class QueryResponse(BaseModel):
    type:           str                        # "jql" or "general"
    profile:        str                        # active Jira profile name
    jira_base_url:  str
    jira_type:      str | None                 # effective search API: "cloud" or "server"
    answer:         str | None
    jql:            str | None                 # None for general queries
    total:          int                        # total matching issues in Jira
    shown:          int                        # issues returned in this response
    display_fields: list[str]                  # ordered column headers for the frontend
    issues:         list[dict]                 # normalised issue dicts
    chart_spec:     ChartSpec | None
    filters:        dict[str, list[str]] | None  # facet values for filter dropdowns
    meta:           ServerMeta | None          # model name, backend, timeout
    token_usage:    TokenUsage | None          # prompt token estimates for this query

jira_type reflects the search API actually used for this response — either the profile default or the per-request /cloud//server override. The UI can use this to display which Jira API version was active, or to adapt behaviour for cloud vs server responses.

TokenUsage breaks down prompt size per query:

class TokenUsage(BaseModel):
    system_tokens:   int   # system prompt character count ÷ 4
    fields_tokens:   int   # field context block ÷ 4
    examples_tokens: int   # RAG examples block ÷ 4
    total_tokens:    int   # total prompt tokens for the initial LLM call
    retry_tokens:    int   # cumulative tokens added across all retry extensions

token_usage is present on every response including error responses, so the frontend can track cost even when a query fails.

For general (non-Jira) questions, jql and chart_spec are None and answer contains the plain-text response.

For JQL queries, answer always includes a result-count suffix appended by the server after the Jira search completes — for example Found 42 result(s)., Found 500 result(s); showing 500. (when paginated), or No results found.

Data files

JQL annotation file (data/jira_jql_annotated_queries.md)

Markdown file with /* comment */\nJQL pairs used as few-shot examples:

/* open bugs assigned to me */
assignee = currentUser() AND issuetype = Bug AND status != Done ORDER BY created DESC

/* high priority tickets created this week */
priority = High AND created >= startOfWeek() ORDER BY created DESC

Jira fields (data/{domain_slug}/jira_fields.json)

Fetched automatically on first run from /rest/api/2/field. Keyed by field ID. A companion jira_allowed_values.json is also fetched and merged in to enrich descriptions with discrete option lists (e.g. status values, issue types).

Jira Assets override config (config/jira_assets_fields.json)

Contains asset_field_keywords (keywords to detect Assets/Insight fields in schema.custom) and optional per-field object type overrides. See the Jira Assets section.

Jira Assets cache (data/{domain_slug}/jira_assets.json)

Written automatically at startup by the Assets auto-detect flow. Contains all object labels per detected asset field. Re-run refresh_asset_values() whenever asset objects change in Jira — the server detects the hash change and re-seeds the jira_asset_values table on next startup.

Running vLLM on a GPU system (GPU inference server)

AtlasMind on OCI A1 can offload all LLM inference to a local GPU system over Tailscale. Only vLLM needs to run on the GPU system — no database, no AtlasMind installation required there.

What runs where

Machine What runs
GPU system vLLM only — serves the model over HTTP
OCI A1 (always-on) AtlasMind + Postgres + Ollama (fallback) + frontend

AtlasMind on OCI A1 sends prompts to vLLM on the GPU system over Tailscale. When the GPU system is off, AtlasMind falls back to its local Ollama automatically.

Step 1 — Install WSL2 (Windows only)

vLLM does not run natively on Windows. You need WSL2 with Ubuntu.

Open PowerShell as Administrator and run:

wsl --install

Restart when prompted. After restart, Ubuntu opens and asks you to create a username and password. This is your Linux environment — all remaining steps run inside WSL2.

To open WSL2 later: search for Ubuntu in the Start menu, or run wsl in any terminal.

Step 2 — Verify the GPU is visible in WSL2

The NVIDIA driver is automatically bridged from Windows into WSL2 — no separate CUDA toolkit installation needed. vLLM's pip package bundles the CUDA runtime libraries it needs.

Run inside WSL2:

nvidia-smi

You should see your GPU listed with driver version and VRAM. If this command fails, reinstall the latest NVIDIA driver on Windows first, then retry.

Step 3 — Install vLLM in a virtual environment

Ubuntu 24.04 does not allow system-wide pip installs. Use a virtual environment:

python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install vllm

After activation you will see (vllm-env) in your prompt. This download is large (~5 GB) — let it complete fully before continuing.

Always activate the environment before running vLLM in future sessions:

source ~/vllm-env/bin/activate

Step 4 — Choose and run a model

With 8 GB VRAM, use a quantized 7B model. AWQ quantization gives the best quality-to-size ratio and is natively supported by vLLM.

Recommended for AtlasMind (reliable structured JSON and JQL output):

vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --port 8002 \
  --host 0.0.0.0

Qwen2.5-Coder is preferred over the general instruct variant because JQL is a query language (similar to SQL). The Coder model is trained on code and structured DSLs, making it more reliable at generating syntactically correct JQL and strictly following the JSON output format (jql, intent_fields, chart_spec, answer).

--gpu-memory-utilization 0.85 reserves 85% of VRAM for vLLM. The default is 0.9 (90%) which can exceed available VRAM on 8 GB cards due to Windows/WSL2 overhead. Lower to 0.80 if startup still fails.

--max-model-len 8192 caps the context window at 8192 tokens. The model's default (32768) requires more KV cache than fits in 8 GB after loading weights. 8192 is sufficient for AtlasMind — typical prompts (system prompt + RAG examples + query) are 1500–2500 tokens.

On first run, this downloads the model weights from HuggingFace (~4.5 GB). Subsequent runs load from the local cache. Wait until you see:

INFO:     Application startup complete.

The server is now listening on port 8002.

Alternative models (all fit in 8 GB VRAM with AWQ):

Model VRAM Notes
Qwen/Qwen2.5-7B-Instruct-AWQ ~4.5 GB General instruct, solid fallback
meta-llama/Llama-3.1-8B-Instruct-AWQ ~5.5 GB Strong reasoning, good alternative

Step 5 — Verify the server is running

From WSL2, confirm the API responds:

curl http://localhost:8002/v1/models

You should see a JSON response listing the loaded model name.

Step 6 — Configure AtlasMind on OCI A1

On the OCI A1 machine, set the following environment variables before starting AtlasMind:

export VLLM_URL=http://<gpu-system-tailscale-ip>:8002

Replace <gpu-system-tailscale-ip> with the GPU system's Tailscale IP address (find it by running tailscale ip in PowerShell on the GPU system, or clicking the Tailscale tray icon).

Then start AtlasMind with the vLLM backend:

uv run python app.py --server --model vllm

AtlasMind auto-detects the loaded model from vLLM's /v1/models endpoint — no need to set the model name explicitly.

Keeping vLLM running across WSL2 sessions

WSL2 shuts down when you close the terminal. To keep vLLM running in the background:

source ~/vllm-env/bin/activate
nohup vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --port 8002 \
  --host 0.0.0.0 > ~/vllm.log 2>&1 &

Logs go to ~/vllm.log. Check them with tail -f ~/vllm.log.


Setting up Tailscale for vLLM access

Tailscale creates a private network between your GPU system and OCI A1, so AtlasMind can reach vLLM securely without exposing any ports to the internet.

Step 1 — Install Tailscale on Windows

Download and install Tailscale from tailscale.com/download. Run the installer and sign in with your Tailscale account (Google, GitHub, or Microsoft login).

Once signed in, Tailscale assigns your Windows machine a private IP in the 100.x.x.x range. You will see the Tailscale icon in the system tray.

Step 2 — Configure WSL2 networking

Edit (or create) C:\Users\<username>\.wslconfig and add:

[wsl2]
networkingMode=mirrored
firewall=false

Restart WSL2 to apply:

wsl --shutdown

networkingMode=mirrored makes WSL2 share the Windows network stack directly — vLLM is reachable at the Windows machine's IP without any port proxy. firewall=false disables the WSL2 Hyper-V firewall layer, which otherwise blocks inbound connections independently of other firewall rules.

Step 3 — Configure the Windows firewall

Add an inbound allow rule for TCP port 8002.

If you use Windows Defender Firewall only:

  1. Press Win + R → type wf.msc → Enter
  2. Inbound Rules → New Rule → Port → TCP → 8002 → Allow the connection → All profiles → Finish

If you use a third-party firewall suite (e.g. Norton 360, McAfee):

Third-party firewall suites include their own firewall engine that runs alongside Windows Defender Firewall. Add the port 8002 allow rule in your firewall suite's settings — for Norton: Settings → Firewall → Traffic Rules → Add → Action: Allow, Direction: Inbound, Protocol: TCP, Local port: 8002, Profile: All.

Note: If inbound connections are still blocked after adding the rule, both firewall engines may be active simultaneously and conflicting. If your third-party suite is the intended firewall, disable Windows Defender Firewall so only one engine is enforcing rules. Run the following in PowerShell as Administrator:

Set-NetFirewallProfile -Profile Domain,Public,Private -Enabled False

This disables Windows Defender Firewall across all profiles. Your third-party firewall (Norton, McAfee, etc.) remains active.

Step 4 — Restarting vLLM after WSL2 shutdown

WSL2 resets completely on every shutdown (wsl --shutdown, PC restart, or closing the terminal) — all running processes including vLLM are killed. You must restart vLLM each time WSL2 comes back up.

To make this less tedious, add a shell alias to your ~/.bashrc:

echo "alias start-vllm='source ~/vllm-env/bin/activate && vllm serve Qwen/Qwen2.5-Coder-7B-Instruct-AWQ --quantization awq --gpu-memory-utilization 0.85 --max-model-len 8192 --port 8002 --host 0.0.0.0'" >> ~/.bashrc
source ~/.bashrc

Then to start vLLM in any future session:

start-vllm

Or in the background:

start-vllm > ~/vllm.log 2>&1 &

Then follow the logs:

tail -f ~/vllm.log

Step 5 — Find your Tailscale IP (on the GPU system)

In PowerShell on Windows, run:

tailscale ip

Or click the Tailscale system tray icon — your IP is shown at the top. It will look like 100.x.x.x.

Note this IP — you will set it as VLLM_URL on OCI A1.

Step 6 — Install Tailscale on OCI A1

On the OCI A1 instance, run:

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

Follow the authentication link printed in the terminal to connect OCI A1 to the same Tailscale account. Once authenticated, OCI A1 and your GPU system are on the same private network.

Step 7 — Verify connectivity

From OCI A1, confirm it can reach vLLM on the GPU system (replace with your actual Tailscale IP):

curl http://100.x.x.x:8002/v1/models

You should get back a JSON response listing the loaded model. If the request times out, check that:

  • vLLM is running in WSL2 with --host 0.0.0.0
  • .wslconfig has networkingMode=mirrored and firewall=false, and WSL2 was restarted after the change
  • Both machines show as Connected in the Tailscale admin console at login.tailscale.com
  • The firewall allow rule for port 8002 is in place (Step 3)
  • If using a third-party firewall suite, check whether both firewall engines are conflicting (see Step 3 note)

Step 8 — Configure AtlasMind

On OCI A1, set the Tailscale IP before starting the server:

export VLLM_URL=http://100.x.x.x:8002
uv run python app.py --server --model vllm

Running tests

uv run python -m pytest tests/ -v

Packages

 
 
 

Contributors

Languages