OS AI Computer Use

The most capable open-source desktop automation agent — 75.0% on OSWorld, surpassing human performance. Supports OpenAI GPT-5.4 and Anthropic Claude. Cross-platform. Production-ready.

Coming soon: MCP-first architecture with sandboxed code execution, plug & play model backends, and isolated environments — in active development.

2026-03-16.23.16.36.mov

For End Users

Want to use OS AI without coding? Download the latest release for your platform:

Download Latest Release

Available for:

macOS (Intel + Apple Silicon)
Windows (x64)
Linux (x64)
Web

New to OS AI? Read the User Guide for installation and setup instructions.

Key Features:

🧠 Multi-provider AI — OpenAI GPT-5.4 and Anthropic Claude, switchable in Settings
🖥️ AI controls your desktop: clicks, types, scrolls, drags, takes screenshots
🔒 Secure API key storage in system keychain
💬 Chat-based interface with visual feedback
📊 Real-time cost tracking for both providers
🎨 Cross-platform Flutter UI (macOS, Windows, Linux, Web)
🖼️ Image upload and clipboard paste
💬 Multiple chat sessions with persistent history
🔄 Conversation context resume after app restart

Supported AI Providers

Provider	Model	Computer Use	Status
OpenAI	GPT-5.4	Batched actions, `previous_response_id` continuity	Fully supported
Anthropic	Claude Sonnet 4.6 / Opus 4.6	Single actions, zoom, full message history	Fully supported

Switch providers in Settings — enter your API key and select the active provider from the dropdown.

For Developers

OpenAI GPT-5.4 with batched actions and previous_response_id for efficient multi-step workflows
Anthropic Claude Sonnet 4.6 / Opus 4.6 with single-action precision, zoom support, and full message history
Provider selection in UI Settings with per-provider API key management
Smooth mouse movement, clicks, drag-and-drop with easing and timing controls
Reliable keyboard input, hotkeys and hold sequences
Screenshots (Quartz on macOS or PyAutoGUI fallback), on-disk saving and base64 tool_result
Detailed logs and running cost estimation per iteration and total
Multiple chats, image upload, persistent chat history with context resume

See provider architecture in docs/architecture-universal-llm.md, OS ports/drivers in docs/os-architecture.md, and packaging notes in docs/ci-packaging.md.

Installation & Setup

Requirements:

macOS 13+ or Windows 10/11 or Linux (X11/XWayland)
Python 3.12+
API key for at least one provider:
- OpenAI: OPENAI_API_KEY (for GPT-5.4 Computer Use)
- Anthropic: ANTHROPIC_API_KEY (for Claude Computer Use)

Linux system dependencies (if applicable):

sudo apt-get install -y scrot gnome-screenshot xdotool xclip python3-tk

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install dependencies
make install

# (optional) install local packages in editable mode (mono-repo dev)
make dev-install

macOS permissions (for GUI automation):

make macos-perms  # opens System Settings → Privacy & Security panels

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.

Quick start

Requirements:

macOS 13+ or Windows 10/11 or Linux (X11/XWayland; unit tests on any OS; GUI tests macOS/self-hosted Windows/Linux)
Python 3.12+
API key: OPENAI_API_KEY or ANTHROPIC_API_KEY

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install top-level dependencies
make install

macOS permissions (required for GUI automation):

# open System Settings → Privacy & Security panels
make macos-perms

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.

Run the agent (CLI):

# With OpenAI GPT-5.4
export OPENAI_API_KEY=sk-...
python -m os_ai_cli --provider openai --task "Open Safari and search for 'AI news'"

# With Anthropic Claude
export ANTHROPIC_API_KEY=sk-ant-...
python -m os_ai_cli --provider anthropic --task "Open Safari and search for 'AI news'"

# Or use LLM_PROVIDER env var (default: anthropic)
export LLM_PROVIDER=openai
python -m os_ai_cli --task "Take a screenshot and describe what you see"

CLI Examples

# 1) OpenAI: Open browser, search, take screenshot
python -m os_ai_cli --provider openai --task "Open Chrome, search for 'computer use AI', open first result, scroll down and take a screenshot"

# 2) Anthropic: Copy/paste workflow
python -m os_ai_cli --provider anthropic --task "Open TextEdit, type 'Hello world!', select all and copy, create another document and paste"

# 3) Window management + hotkeys
python -m os_ai_cli --task "Open System Settings, search for 'Privacy', navigate to Privacy & Security"

# 4) Drag operations (OpenAI supports multi-point paths for drawing)
python -m os_ai_cli --provider openai --task "In Finder, open Downloads, switch to icon view, drag the first file to Desktop"

Useful make targets:

make install                     # install top-level dependencies
make test                        # unit tests
RUN_CURSOR_TESTS=1 make itest    # GUI integration tests (macOS; requires permissions)
make itest-local-keyboard        # run keyboard harness
make itest-local-click           # run click/drag harness

Development Mode

For development with backend + frontend (Flutter UI):

1. Install dependencies

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate

# install Python dependencies
make install

# install local packages in editable mode for mono-repo dev
make dev-install

2. Start the backend

# Set API key for your provider
export OPENAI_API_KEY=sk-...        # for OpenAI
# export ANTHROPIC_API_KEY=sk-ant-... # for Anthropic

# Select default provider (optional, can also set in Flutter UI Settings)
export LLM_PROVIDER=openai

# (optional) enable debug mode
export OS_AI_BACKEND_DEBUG=1

# Start backend on 127.0.0.1:8765
os-ai-backend

# Or run directly via Python module
# python -m os_ai_backend

Backend environment variables (optional):

LLM_PROVIDER - default AI provider: openai or anthropic (default: anthropic)
OS_AI_BACKEND_HOST - host address (default: 127.0.0.1)
OS_AI_BACKEND_PORT - port number (default: 8765)
OS_AI_BACKEND_DEBUG - enable debug logging (default: 0)
OS_AI_BACKEND_TOKEN - authentication token (optional)
OS_AI_BACKEND_CORS_ORIGINS - allowed CORS origins (default: http://localhost,http://127.0.0.1)

Backend endpoints:

GET /healthz - health check
WS /ws - WebSocket for JSON-RPC commands
POST /v1/files - file upload
GET /v1/files/{file_id} - file download
GET /metrics - metrics snapshot

3. Start the frontend (in a new terminal)

cd frontend_flutter

# Install Flutter dependencies
flutter pub get

# Run on macOS
flutter run -d macos

# Or run on other platforms
# flutter run -d chrome   # web
# flutter run -d windows  # Windows
# flutter run -d linux    # Linux

Frontend config (in code):

Default backend WebSocket: ws://127.0.0.1:8765/ws
Default REST base: http://127.0.0.1:8765

See frontend_flutter/README.md for more details on the Flutter app architecture and features.

Architecture

The project uses a provider-agnostic architecture:

llm/              ← Domain types (Message, ToolCall, LLMClient interface)
llm_anthropic/    ← Anthropic adapter (Claude, Messages API)
llm_openai/       ← OpenAI adapter (GPT-5.4, Responses API)
core/             ← Application logic (Orchestrator, ToolRegistry)
backend/          ← Interface adapter (WebSocket/REST)
cli/              ← Interface adapter (CLI)
frontend_flutter/ ← Presentation layer (Flutter UI)

Adding a new provider requires only creating a new llm_<provider>/ package that implements the LLMClient interface — no changes to core, backend, or frontend needed.

Key design decisions:

ProviderPart — typed content blocks for provider-specific data (replaces text-based markers)
provider_context — opaque state passed between iterations (e.g., OpenAI's previous_response_id)
ToolCall.metadata — internal routing separated from clean action data
Batch handler — unified entry point for single (Anthropic) and batched (OpenAI) actions

See docs/architecture-universal-llm.md for details.

Provider Comparison (March 2026)

	OpenAI GPT-5.4	Anthropic Claude Sonnet 4.6	Anthropic Claude Opus 4.6
OSWorld (desktop tasks)	75.0%	72.5%	72.7%
SWE-Bench Verified (coding)	~80%	—	80.8%
Input price (per 1M tokens)	$2.50	$3.00	$5.00
Output price (per 1M tokens)	$15.00	$15.00	$25.00
Context window	1.05M	1M	1M
Actions per call	Batched (multiple)	Single	Single
Maturity	1st generation	18 months iterated	18 months iterated

TL;DR: GPT-5.4 leads on desktop automation benchmarks and is cheaper for heavy use. Claude Sonnet 4.6 is the best value at similar quality. Claude Opus 4.6 excels at complex coding tasks.

Features

Multi-provider AI: OpenAI GPT-5.4 (batched actions) and Anthropic Claude Sonnet/Opus 4.6 (single actions, zoom)
Smooth mouse motion: easing, distance-based durations
Clicks with modifiers: modifiers: "cmd+shift" for click/down/up
Drag control: multi-point paths for drawing, hold_before_ms, hold_after_ms, steps
Keyboard input: key, hold_key; cross-platform key mapping (cmd/ctrl/win/alt/option)
Screenshots: Quartz (macOS), scrot (Linux), or PyAutoGUI fallback; optional downscale for model display
Logging and cost: per-iteration and total usage/cost with retry logic
Provider-aware cost estimation (GPT-5.4, Claude Sonnet 4.6, Opus 4.6, o4-mini, Haiku)

Supported Platforms

OS-agnostic execution: core depends only on OS ports; drivers are loaded per OS (see docs/os-architecture.md).
macOS (supported):
- Full driver set with overlay (AppKit), robust Enter (Quartz), screenshots (Quartz/PyAutoGUI), sounds (NSSound).
- Integration tests available; requires Accessibility, Input Monitoring, Screen Recording.
- Single-file CLI bundle via make build-macos-bundle.
Windows (implemented, not yet integration-tested):
- Drivers for mouse/keyboard/screen via PyAutoGUI; overlay/sound are no-ops baseline.
- Unit contract tests exist; for GUI tests use a self-hosted Windows runner (see docs/windows-integration-testing.md).
- Single-file CLI bundle via make build-windows-bundle (build on Windows).
Linux (supported, X11):
- Drivers for mouse/keyboard/screen via PyAutoGUI (X11 backend); overlay/sound are no-ops.
- Requires X11 display (XWayland works). Pure Wayland without XWayland is not yet supported.
- System dependencies: scrot or gnome-screenshot (screenshots), xdotool, xclip (clipboard), python3-tk. For system tray: python3-gi, gir1.2-appindicator3-0.1 (optional — app runs without tray if unavailable).
- Unit contract tests and CI with xvfb. Single-file bundle via PyInstaller.

Configuration (config/settings.py)

Key options (partial list):

Coordinates/calibration
- COORD_X_SCALE, COORD_Y_SCALE, COORD_X_OFFSET, COORD_Y_OFFSET
- Post-move correction: POST_MOVE_VERIFY, POST_MOVE_TOLERANCE_PX, POST_MOVE_CORRECTION_DURATION
Screenshots
- SCREENSHOT_MODE (native|downscale)
- VIRTUAL_DISPLAY_ENABLED, VIRTUAL_DISPLAY_WIDTH_PX, VIRTUAL_DISPLAY_HEIGHT_PX
- SCREENSHOT_FORMAT (PNG|JPEG), SCREENSHOT_JPEG_QUALITY
Overlay
- PREMOVE_HIGHLIGHT_ENABLED, PREMOVE_HIGHLIGHT_DEFAULT_DURATION, PREMOVE_HIGHLIGHT_RADIUS, colors
Model/tool
- MODEL_NAME, COMPUTER_TOOL_TYPE, COMPUTER_BETA_FLAG, MAX_TOKENS
- ALLOW_PARALLEL_TOOL_USE

See file for full list and comments.

Tool input (API)

The agent expects blocks with action and parameters:

Mouse movement

{"action":"mouse_move","coordinate":[x,y],"coordinate_space":"auto|screen|model","duration":0.35,"tween":"linear"}

Clicks

{"action":"left_click","coordinate":[x,y],"modifiers":"cmd+shift"}

Key press / hold

{"action":"key","key":"cmd+l"}
{"action":"hold_key","key":"ctrl+shift+t"}

Drag-and-drop (supports multi-point paths)

{
  "action":"left_click_drag",
  "start":[x1,y1],
  "end":[x2,y2],
  "path":[[x1,y1],[x2,y2],[x3,y3]],
  "modifiers":"shift",
  "hold_before_ms":80,
  "hold_after_ms":80,
  "steps":4,
  "step_delay":0.02
}

Scroll

{"action":"scroll","coordinate":[x,y],"scroll_direction":"down|up|left|right","scroll_amount":3}

Typing

{"action":"type","text":"Hello, world!"}

Screenshot

{"action":"screenshot"}

Responses are returned as a list of tool_result content blocks (text/image). Screenshots are base64-encoded.

Tests

Unit tests (no real GUI):

make test

Integration (real OS tests, macOS; Windows via self-hosted runner):

export RUN_CURSOR_TESTS=1
make itest

If macOS blocks automation, tests are skipped. Grant permissions with make macos-perms and retry.

Windows integration testing options are described in docs/windows-integration-testing.md.

Flutter integration

Recommended setup: Flutter as pure UI, local Python service:

Transport: WebSocket + JSON-RPC for chat/commands, REST for files
Streams: screenshots (JPEG/PNG), logs, events
Example notes: docs/flutter.md

To run backend + frontend in development mode, see the Development Mode section above.

Note: project code and docs use English.

Contributing

Fork → feature branch → PR
Code style: readable, explicit names, avoid deep nesting
Tests: add unit tests and integration tests when applicable
Before PR:

make test
RUN_CURSOR_TESTS=1 make itest   # optional if GUI interactions changed

Commit messages: clear and atomic

Architecture, packaging and testing docs:

OS Ports & Drivers: docs/os-architecture.md
Packaging & CI: docs/ci-packaging.md
Windows integration testing: docs/windows-integration-testing.md
Code style: CODE_STYLE.md
Contributing: CONTRIBUTING.md

Packaging (single executable bundles):

macOS: make build-macos-bundle → dist/agent_core/agent_core
Windows: make build-windows-bundle → dist/agent_core/agent_core.exe

License

Apache License 2.0. Preserve NOTICE when distributing.

See LICENSE and NOTICE at repository root.

Troubleshooting

Cursor/keyboard don't work (macOS): grant permissions in System Settings → Privacy & Security (Accessibility, Input Monitoring, Screen Recording) for Terminal and current Python.
Linux: no display error: ensure X11 is running (echo $DISPLAY). Under Wayland, XWayland must be enabled. Install deps: sudo apt-get install scrot gnome-screenshot xdotool xclip python3-tk.
Integration tests skipped: restart terminal, ensure same interpreter (which python, python -c 'import sys; print(sys.executable)').
Screenshots empty/missing overlay: enable Screen Recording; check screenshot mode settings.

Contact

Issues/PR in this repository. Attribution is listed in NOTICE.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github/workflows		.github/workflows
docs		docs
frontend_flutter		frontend_flutter
packages		packages
packaging		packaging
tests		tests
utils		utils
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_STYLE.md		CODE_STYLE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
USER_GUIDE.md		USER_GUIDE.md
VERSION		VERSION
launcher.py		launcher.py
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OS AI Computer Use

For End Users

Supported AI Providers

For Developers

Table of Contents

Installation & Setup

Quick start

CLI Examples

Development Mode

1. Install dependencies

2. Start the backend

3. Start the frontend (in a new terminal)

Architecture

Provider Comparison (March 2026)

Features

Supported Platforms

Configuration (config/settings.py)

Tool input (API)

Tests

Flutter integration

Contributing

License

Troubleshooting

Contact

About

Uh oh!

Releases 5

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OS AI Computer Use

For End Users

Supported AI Providers

For Developers

Table of Contents

Installation & Setup

Quick start

CLI Examples

Development Mode

1. Install dependencies

2. Start the backend

3. Start the frontend (in a new terminal)

Architecture

Provider Comparison (March 2026)

Features

Supported Platforms

Configuration (config/settings.py)

Tool input (API)

Tests

Flutter integration

Contributing

License

Troubleshooting

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors

Uh oh!

Languages