The most capable open-source desktop automation agent — 75.0% on OSWorld, surpassing human performance. Supports OpenAI GPT-5.4 and Anthropic Claude. Cross-platform. Production-ready.
Coming soon: MCP-first architecture with sandboxed code execution, plug & play model backends, and isolated environments — in active development.
2026-03-16.23.16.36.mov
Want to use OS AI without coding? Download the latest release for your platform:
Available for:
- macOS (Intel + Apple Silicon)
- Windows (x64)
- Linux (x64)
- Web
New to OS AI? Read the User Guide for installation and setup instructions.
Key Features:
- 🧠 Multi-provider AI — OpenAI GPT-5.4 and Anthropic Claude, switchable in Settings
- 🖥️ AI controls your desktop: clicks, types, scrolls, drags, takes screenshots
- 🔒 Secure API key storage in system keychain
- 💬 Chat-based interface with visual feedback
- 📊 Real-time cost tracking for both providers
- 🎨 Cross-platform Flutter UI (macOS, Windows, Linux, Web)
- 🖼️ Image upload and clipboard paste
- 💬 Multiple chat sessions with persistent history
- 🔄 Conversation context resume after app restart
| Provider | Model | Computer Use | Status |
|---|---|---|---|
| OpenAI | GPT-5.4 | Batched actions, previous_response_id continuity |
Fully supported |
| Anthropic | Claude Sonnet 4.6 / Opus 4.6 | Single actions, zoom, full message history | Fully supported |
Switch providers in Settings — enter your API key and select the active provider from the dropdown.
- OS AI Computer Use
Local agent for desktop automation with multi-provider AI support. Currently supports OpenAI GPT-5.4 Computer Use and Anthropic Claude Computer Use. The LLM layer is abstracted behind LLMClient, making it easy to add new providers.
What this project is:
- A multi-provider Computer Use agent (OpenAI + Anthropic) with a stable tool interface
- An OS-agnostic execution layer using ports/drivers (macOS, Windows, and Linux)
- A CLI you can bundle into a single executable for local use
What it is not (yet):
- A remote SaaS; this is a local agent
Highlights:
- OpenAI GPT-5.4 with batched actions and
previous_response_idfor efficient multi-step workflows - Anthropic Claude Sonnet 4.6 / Opus 4.6 with single-action precision, zoom support, and full message history
- Provider selection in UI Settings with per-provider API key management
- Smooth mouse movement, clicks, drag-and-drop with easing and timing controls
- Reliable keyboard input, hotkeys and hold sequences
- Screenshots (Quartz on macOS or PyAutoGUI fallback), on-disk saving and base64 tool_result
- Detailed logs and running cost estimation per iteration and total
- Multiple chats, image upload, persistent chat history with context resume
See provider architecture in docs/architecture-universal-llm.md, OS ports/drivers in docs/os-architecture.md, and packaging notes in docs/ci-packaging.md.
Requirements:
- macOS 13+ or Windows 10/11 or Linux (X11/XWayland)
- Python 3.12+
- API key for at least one provider:
- OpenAI:
OPENAI_API_KEY(for GPT-5.4 Computer Use) - Anthropic:
ANTHROPIC_API_KEY(for Claude Computer Use)
- OpenAI:
Linux system dependencies (if applicable):
sudo apt-get install -y scrot gnome-screenshot xdotool xclip python3-tkInstall:
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install dependencies
make install
# (optional) install local packages in editable mode (mono-repo dev)
make dev-installmacOS permissions (for GUI automation):
make macos-perms # opens System Settings → Privacy & Security panelsGrant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.
Requirements:
- macOS 13+ or Windows 10/11 or Linux (X11/XWayland; unit tests on any OS; GUI tests macOS/self-hosted Windows/Linux)
- Python 3.12+
- API key:
OPENAI_API_KEYorANTHROPIC_API_KEY
Install:
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install top-level dependencies
make installmacOS permissions (required for GUI automation):
# open System Settings → Privacy & Security panels
make macos-permsGrant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.
Run the agent (CLI):
# With OpenAI GPT-5.4
export OPENAI_API_KEY=sk-...
python -m os_ai_cli --provider openai --task "Open Safari and search for 'AI news'"
# With Anthropic Claude
export ANTHROPIC_API_KEY=sk-ant-...
python -m os_ai_cli --provider anthropic --task "Open Safari and search for 'AI news'"
# Or use LLM_PROVIDER env var (default: anthropic)
export LLM_PROVIDER=openai
python -m os_ai_cli --task "Take a screenshot and describe what you see"# 1) OpenAI: Open browser, search, take screenshot
python -m os_ai_cli --provider openai --task "Open Chrome, search for 'computer use AI', open first result, scroll down and take a screenshot"
# 2) Anthropic: Copy/paste workflow
python -m os_ai_cli --provider anthropic --task "Open TextEdit, type 'Hello world!', select all and copy, create another document and paste"
# 3) Window management + hotkeys
python -m os_ai_cli --task "Open System Settings, search for 'Privacy', navigate to Privacy & Security"
# 4) Drag operations (OpenAI supports multi-point paths for drawing)
python -m os_ai_cli --provider openai --task "In Finder, open Downloads, switch to icon view, drag the first file to Desktop"Useful make targets:
make install # install top-level dependencies
make test # unit tests
RUN_CURSOR_TESTS=1 make itest # GUI integration tests (macOS; requires permissions)
make itest-local-keyboard # run keyboard harness
make itest-local-click # run click/drag harnessFor development with backend + frontend (Flutter UI):
# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install Python dependencies
make install
# install local packages in editable mode for mono-repo dev
make dev-install# Set API key for your provider
export OPENAI_API_KEY=sk-... # for OpenAI
# export ANTHROPIC_API_KEY=sk-ant-... # for Anthropic
# Select default provider (optional, can also set in Flutter UI Settings)
export LLM_PROVIDER=openai
# (optional) enable debug mode
export OS_AI_BACKEND_DEBUG=1
# Start backend on 127.0.0.1:8765
os-ai-backend
# Or run directly via Python module
# python -m os_ai_backendBackend environment variables (optional):
LLM_PROVIDER- default AI provider:openaioranthropic(default:anthropic)OS_AI_BACKEND_HOST- host address (default:127.0.0.1)OS_AI_BACKEND_PORT- port number (default:8765)OS_AI_BACKEND_DEBUG- enable debug logging (default:0)OS_AI_BACKEND_TOKEN- authentication token (optional)OS_AI_BACKEND_CORS_ORIGINS- allowed CORS origins (default:http://localhost,http://127.0.0.1)
Backend endpoints:
GET /healthz- health checkWS /ws- WebSocket for JSON-RPC commandsPOST /v1/files- file uploadGET /v1/files/{file_id}- file downloadGET /metrics- metrics snapshot
cd frontend_flutter
# Install Flutter dependencies
flutter pub get
# Run on macOS
flutter run -d macos
# Or run on other platforms
# flutter run -d chrome # web
# flutter run -d windows # Windows
# flutter run -d linux # LinuxFrontend config (in code):
- Default backend WebSocket:
ws://127.0.0.1:8765/ws - Default REST base:
http://127.0.0.1:8765
See frontend_flutter/README.md for more details on the Flutter app architecture and features.
The project uses a provider-agnostic architecture:
llm/ ← Domain types (Message, ToolCall, LLMClient interface)
llm_anthropic/ ← Anthropic adapter (Claude, Messages API)
llm_openai/ ← OpenAI adapter (GPT-5.4, Responses API)
core/ ← Application logic (Orchestrator, ToolRegistry)
backend/ ← Interface adapter (WebSocket/REST)
cli/ ← Interface adapter (CLI)
frontend_flutter/ ← Presentation layer (Flutter UI)
Adding a new provider requires only creating a new llm_<provider>/ package that implements the LLMClient interface — no changes to core, backend, or frontend needed.
Key design decisions:
- ProviderPart — typed content blocks for provider-specific data (replaces text-based markers)
- provider_context — opaque state passed between iterations (e.g., OpenAI's
previous_response_id) - ToolCall.metadata — internal routing separated from clean action data
- Batch handler — unified entry point for single (Anthropic) and batched (OpenAI) actions
See docs/architecture-universal-llm.md for details.
| OpenAI GPT-5.4 | Anthropic Claude Sonnet 4.6 | Anthropic Claude Opus 4.6 | |
|---|---|---|---|
| OSWorld (desktop tasks) | 75.0% | 72.5% | 72.7% |
| SWE-Bench Verified (coding) | ~80% | — | 80.8% |
| Input price (per 1M tokens) | $2.50 | $3.00 | $5.00 |
| Output price (per 1M tokens) | $15.00 | $15.00 | $25.00 |
| Context window | 1.05M | 1M | 1M |
| Actions per call | Batched (multiple) | Single | Single |
| Maturity | 1st generation | 18 months iterated | 18 months iterated |
TL;DR: GPT-5.4 leads on desktop automation benchmarks and is cheaper for heavy use. Claude Sonnet 4.6 is the best value at similar quality. Claude Opus 4.6 excels at complex coding tasks.
- Multi-provider AI: OpenAI GPT-5.4 (batched actions) and Anthropic Claude Sonnet/Opus 4.6 (single actions, zoom)
- Smooth mouse motion: easing, distance-based durations
- Clicks with modifiers:
modifiers: "cmd+shift"for click/down/up - Drag control: multi-point paths for drawing,
hold_before_ms,hold_after_ms,steps - Keyboard input:
key,hold_key; cross-platform key mapping (cmd/ctrl/win/alt/option) - Screenshots: Quartz (macOS), scrot (Linux), or PyAutoGUI fallback; optional downscale for model display
- Logging and cost: per-iteration and total usage/cost with retry logic
- Provider-aware cost estimation (GPT-5.4, Claude Sonnet 4.6, Opus 4.6, o4-mini, Haiku)
- OS-agnostic execution: core depends only on OS ports; drivers are loaded per OS (see
docs/os-architecture.md). - macOS (supported):
- Full driver set with overlay (AppKit), robust Enter (Quartz), screenshots (Quartz/PyAutoGUI), sounds (NSSound).
- Integration tests available; requires Accessibility, Input Monitoring, Screen Recording.
- Single-file CLI bundle via
make build-macos-bundle.
- Windows (implemented, not yet integration-tested):
- Drivers for mouse/keyboard/screen via PyAutoGUI; overlay/sound are no-ops baseline.
- Unit contract tests exist; for GUI tests use a self-hosted Windows runner (see
docs/windows-integration-testing.md). - Single-file CLI bundle via
make build-windows-bundle(build on Windows).
- Linux (supported, X11):
- Drivers for mouse/keyboard/screen via PyAutoGUI (X11 backend); overlay/sound are no-ops.
- Requires X11 display (XWayland works). Pure Wayland without XWayland is not yet supported.
- System dependencies:
scrotorgnome-screenshot(screenshots),xdotool,xclip(clipboard),python3-tk. For system tray:python3-gi,gir1.2-appindicator3-0.1(optional — app runs without tray if unavailable). - Unit contract tests and CI with xvfb. Single-file bundle via PyInstaller.
Key options (partial list):
- Coordinates/calibration
COORD_X_SCALE,COORD_Y_SCALE,COORD_X_OFFSET,COORD_Y_OFFSET- Post-move correction:
POST_MOVE_VERIFY,POST_MOVE_TOLERANCE_PX,POST_MOVE_CORRECTION_DURATION
- Screenshots
SCREENSHOT_MODE(native|downscale)VIRTUAL_DISPLAY_ENABLED,VIRTUAL_DISPLAY_WIDTH_PX,VIRTUAL_DISPLAY_HEIGHT_PXSCREENSHOT_FORMAT(PNG|JPEG),SCREENSHOT_JPEG_QUALITY
- Overlay
PREMOVE_HIGHLIGHT_ENABLED,PREMOVE_HIGHLIGHT_DEFAULT_DURATION,PREMOVE_HIGHLIGHT_RADIUS, colors
- Model/tool
MODEL_NAME,COMPUTER_TOOL_TYPE,COMPUTER_BETA_FLAG,MAX_TOKENSALLOW_PARALLEL_TOOL_USE
See file for full list and comments.
The agent expects blocks with action and parameters:
- Mouse movement
{"action":"mouse_move","coordinate":[x,y],"coordinate_space":"auto|screen|model","duration":0.35,"tween":"linear"}- Clicks
{"action":"left_click","coordinate":[x,y],"modifiers":"cmd+shift"}- Key press / hold
{"action":"key","key":"cmd+l"}
{"action":"hold_key","key":"ctrl+shift+t"}- Drag-and-drop (supports multi-point paths)
{
"action":"left_click_drag",
"start":[x1,y1],
"end":[x2,y2],
"path":[[x1,y1],[x2,y2],[x3,y3]],
"modifiers":"shift",
"hold_before_ms":80,
"hold_after_ms":80,
"steps":4,
"step_delay":0.02
}- Scroll
{"action":"scroll","coordinate":[x,y],"scroll_direction":"down|up|left|right","scroll_amount":3}- Typing
{"action":"type","text":"Hello, world!"}- Screenshot
{"action":"screenshot"}Responses are returned as a list of tool_result content blocks (text/image). Screenshots are base64-encoded.
Unit tests (no real GUI):
make testIntegration (real OS tests, macOS; Windows via self-hosted runner):
export RUN_CURSOR_TESTS=1
make itestIf macOS blocks automation, tests are skipped. Grant permissions with make macos-perms and retry.
Windows integration testing options are described in docs/windows-integration-testing.md.
Recommended setup: Flutter as pure UI, local Python service:
- Transport: WebSocket + JSON-RPC for chat/commands, REST for files
- Streams: screenshots (JPEG/PNG), logs, events
- Example notes:
docs/flutter.md
To run backend + frontend in development mode, see the Development Mode section above.
Note: project code and docs use English.
- Fork → feature branch → PR
- Code style: readable, explicit names, avoid deep nesting
- Tests: add unit tests and integration tests when applicable
- Before PR:
make test
RUN_CURSOR_TESTS=1 make itest # optional if GUI interactions changed- Commit messages: clear and atomic
Architecture, packaging and testing docs:
- OS Ports & Drivers:
docs/os-architecture.md - Packaging & CI:
docs/ci-packaging.md - Windows integration testing:
docs/windows-integration-testing.md - Code style:
CODE_STYLE.md - Contributing:
CONTRIBUTING.md
Packaging (single executable bundles):
- macOS:
make build-macos-bundle→dist/agent_core/agent_core - Windows:
make build-windows-bundle→dist/agent_core/agent_core.exe
Apache License 2.0. Preserve NOTICE when distributing.
- See
LICENSEandNOTICEat repository root.
- Cursor/keyboard don't work (macOS): grant permissions in System Settings → Privacy & Security (Accessibility, Input Monitoring, Screen Recording) for Terminal and current Python.
- Linux: no display error: ensure X11 is running (
echo $DISPLAY). Under Wayland, XWayland must be enabled. Install deps:sudo apt-get install scrot gnome-screenshot xdotool xclip python3-tk. - Integration tests skipped: restart terminal, ensure same interpreter (
which python,python -c 'import sys; print(sys.executable)'). - Screenshots empty/missing overlay: enable Screen Recording; check screenshot mode settings.
Issues/PR in this repository. Attribution is listed in NOTICE.