Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion config/agent/GPT-5.4-computer-use.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ custom_actions:
use_html: false
use_axtree: false
use_screenshot: true
use_som: false
save_som: false
extract_visible_tag: false
extract_clickable_tag: false
extract_coords: false
Expand Down
2 changes: 1 addition & 1 deletion config/agent/UI-TARS-1.5-7B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ custom_actions:
use_html: false
use_axtree: false
use_screenshot: true
use_som: false
save_som: false
extract_visible_tag: false
extract_clickable_tag: false
extract_coords: false
Expand Down
2 changes: 1 addition & 1 deletion config/agent/axtree-only.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ custom_actions: ["click", "fill", "dblclick", "clear", "select_option", "drag_an
# --- observation flags ---
use_axtree: True # enable AXTREE observation
use_screenshot: False # enable screenshot observation
use_som: False # Add a set of marks to the screenshot.
save_som: False # Add a set of marks to the screenshot.
extract_coords: False # Add the coordinates of the elements.

# --- Prompt Flags ---
Expand Down
2 changes: 1 addition & 1 deletion config/agent/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ use_screenshot: True # enable screenshot observation

# ---- these are not really changed, but leaving it here for future reference ----
# use_html: False # enable HTML observation
use_som: False # Add a set of marks to the screenshot.
save_som: False # Add a set of marks to the screenshot.
# extract_visible_tag: False # Add a "visible" tag to visible elements in the AXTree.
# extract_clickable_tag: False # Add a "clickable" tag to clickable elements in the AXTree.
extract_coords: False # Add the coordinates of the elements.
Expand Down
1 change: 1 addition & 0 deletions config/agent/dummy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ model_pretty_name: dummy # for wandb-logging
use_html: True
use_axtree: True
use_screenshot: True
save_som: False # set to True to save set_of_marks_coordinates.json
hostname: "no host name for dumb dumbs" # dummy agent does not use hostname
client_type: dummy
2 changes: 1 addition & 1 deletion config/agent/screenshot-only.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
custom_actions: ["go_back", "go_forward", "goto", "mouse_click", "mouse_dblclick", "scroll", "mouse_move", "mouse_down", "mouse_up", "mouse_click", "mouse_dblclick", "mouse_drag_and_drop", "mouse_upload_file", "keyboard_down", "keyboard_up", "keyboard_press", "keyboard_type", "keyboard_insert_text"]
use_axtree: False # enable AXTREE observation
use_screenshot: True # enable screenshot observation
use_som: False # Add a set of marks to the screenshot.
save_som: False # Add a set of marks to the screenshot.
extract_coords: False # Add the coordinates of the elements.
prompt_txt:
system_prompt: null # takes default system prompt from dp lib
Expand Down
9 changes: 9 additions & 0 deletions docs/Intro to UI Agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@ Agents receive a screenshot of the current apps (in the same manner a human sees

The agent then outputs an action such as `click` or `type` that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a `max_steps` is reached.

You can view configs for `configs/agents/default.yaml` containing:

- list of actions
- `use_axtree`: produces simplified text representation of each app states as an input
- `use_screenshot`: provides screenshot of app as an input
- `save_som`: if true, saves set of marks in `log_outputs/<timestamp>/set_of_marks_coordinates.json` json (see example below)


You can view `UI-Tars-1.5-7B.yaml` as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see `GPT-5.1.yaml`.


## OpenApps: building blocks for digital agent research
Expand Down
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ theme:
pygments_style: # default styles
light: shadcn-light
dark: github-dark
icon: heroicons:rectangle-stack # use the shadcn svg if not defined
topbar_sections: false # NEW!
show_datetime: false

Expand Down
36 changes: 36 additions & 0 deletions site/Intro to UI Agents.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@

Digital agents open the possibility of AI systems to complete tedious tasks on your behalf. For example, `"add an event to my calendar"` or even more complex multi-step tasks. Yet, today's agents are still not reliable enough for many applications. To get there, we need lots of data for training and evaluation + lots of research to develop new recipes for training and deploying reliable agents.

A few definitions to settle you in:

!!! note "Digital (UI) Agent:"
completes tasks by directly interacting with apps in the same manner as humans (by clicking, scrolling, typing on your behalf)

!!! note "Reward:"
measures whether the agent completed the given task


![landing](images/pomdp.png)

## Agents under the hood

Digital agents are powered by a foundation model that can understand both text and image inputs.
Agents receive a screenshot of the current apps (in the same manner a human sees them) and the task goal ("delete Brooklyn Bridge from my favorite places"); depending on how you configure the agent, the agent can also track past actions or observations.

The agent then outputs an action such as `click` or `type` that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a `max_steps` is reached.

You can view configs for `configs/agents/default.yaml` containing:

- list of actions
- `use_axtree`: produces simplified text representation of each app states as an input
- `use_screenshot`: provides screenshot of app as an input
- `save_som`: if true, saves set of marks in `log_outputs/<timestamp>/set_of_marks_coordinates.json` json (see example below)


You can view `UI-Tars-1.5-7B.yaml` as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see `GPT-5.1.yaml`.


## OpenApps: building blocks for digital agent research

OpenApps offers an easy to use environment that runs on one CPU written in Python for stuyding digital agents. OpenApps comes with six configurable apps for generating limitless data for training and evaluating digital agents.


### Hands on with OpenApps

Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.

Expand Down
67 changes: 65 additions & 2 deletions site/Intro to UI Agents/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,14 @@
<span class="size-8 flex flex-row justify-center items-center">


<svg xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 24 24"><path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M6 6.878V6a2.25 2.25 0 0 1 2.25-2.25h7.5A2.25 2.25 0 0 1 18 6v.878m-12 0q.354-.126.75-.128h10.5q.396.002.75.128m-12 0A2.25 2.25 0 0 0 4.5 9v.878m13.5-3A2.25 2.25 0 0 1 19.5 9v.878m0 0a2.3 2.3 0 0 0-.75-.128H5.25q-.396.002-.75.128m15 0A2.25 2.25 0 0 1 21 12v6a2.25 2.25 0 0 1-2.25 2.25H5.25A2.25 2.25 0 0 1 3 18v-6c0-.98.626-1.813 1.5-2.122"/></svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 256 256"
class="size-5">
<rect width="256" height="256" fill="none"></rect>
<line x1="208" y1="128" x2="128" y2="208" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
<line x1="192" y1="40" x2="40" y2="192" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
</svg>


</span>
Expand Down Expand Up @@ -463,7 +470,33 @@ <h1 class="pr-2">OpenApps</h1>
</div>
</div>
<div class="typography w-full flex-1 *:data-[slot=alert]:first:mt-0">
<p>Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.</p>
<p>Digital agents open the possibility of AI systems to complete tedious tasks on your behalf. For example, <code>"add an event to my calendar"</code> or even more complex multi-step tasks. Yet, today's agents are still not reliable enough for many applications. To get there, we need lots of data for training and evaluation + lots of research to develop new recipes for training and deploying reliable agents.</p>
<p>A few definitions to settle you in:</p>
<div class="admonition note">
<p class="admonition-title">Digital (UI) Agent:</p>
<p>completes tasks by directly interacting with apps in the same manner as humans (by clicking, scrolling, typing on your behalf)</p>
</div>
<div class="admonition note">
<p class="admonition-title">Reward:</p>
<p>measures whether the agent completed the given task</p>
</div>
<p><img alt="landing" src="../images/pomdp.png" /></p>
<h2 id="agents-under-the-hood">Agents under the hood</h2>
<p>Digital agents are powered by a foundation model that can understand both text and image inputs.
Agents receive a screenshot of the current apps (in the same manner a human sees them) and the task goal ("delete Brooklyn Bridge from my favorite places"); depending on how you configure the agent, the agent can also track past actions or observations.</p>
<p>The agent then outputs an action such as <code>click</code> or <code>type</code> that directly affects the apps. Throughout the interaction we monitor whether the action has completed the task to terminate the loop or a <code>max_steps</code> is reached.</p>
<p>You can view configs for <code>configs/agents/default.yaml</code> containing:</p>
<ul>
<li>list of actions</li>
<li><code>use_axtree</code>: produces simplified text representation of each app states as an input</li>
<li><code>use_screenshot</code>: provides screenshot of app as an input</li>
<li><code>save_som</code>: if true, saves set of marks in <code>log_outputs/&lt;timestamp&gt;/set_of_marks_coordinates.json</code> json (see example below)</li>
</ul>
<p>You can view <code>UI-Tars-1.5-7B.yaml</code> as an example of native computer-use which uses screenshots to output click, type, actions with coordinates. For an example of a multimodal agent that accepts simplified text inputs see <code>GPT-5.1.yaml</code>.</p>
<h2 id="openapps-building-blocks-for-digital-agent-research">OpenApps: building blocks for digital agent research</h2>
<p>OpenApps offers an easy to use environment that runs on one CPU written in Python for stuyding digital agents. OpenApps comes with six configurable apps for generating limitless data for training and evaluating digital agents.</p>
<h3 id="hands-on-with-openapps">Hands on with OpenApps</h3>
<p>Learn how to set up OpenApps, run a GPT-5 agent and make changes to the envrionment.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/gzNW_LXE7OE?si=qLh-r_CvheMIgIWd" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>
</article>
Expand Down Expand Up @@ -561,6 +594,36 @@ <h1 class="pr-2">OpenApps</h1>
<div class="flex flex-col gap-2 p-4 pt-0 text-sm">
<p class="text-muted-foreground bg-background sticky top-0 h-6 text-xs">On This Page</p>




<a href="#agents-under-the-hood"
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
data-active="false" data-depth="2">
Agents under the hood
</a>



<a href="#openapps-building-blocks-for-digital-agent-research"
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
data-active="false" data-depth="2">
OpenApps: building blocks for digital agent research
</a>


<a href="#hands-on-with-openapps"
class="text-muted-foreground hover:text-foreground data-[active=true]:text-foreground text-[0.8rem] no-underline transition-colors data-[depth=3]:pl-4 data-[depth=4]:pl-6"
data-active="false" data-depth="3">
Hands on with OpenApps
</a>








</div>
<div class="h-12"></div>
Expand Down
9 changes: 8 additions & 1 deletion site/agents/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,14 @@
<span class="size-8 flex flex-row justify-center items-center">


<svg xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 24 24"><path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M6 6.878V6a2.25 2.25 0 0 1 2.25-2.25h7.5A2.25 2.25 0 0 1 18 6v.878m-12 0q.354-.126.75-.128h10.5q.396.002.75.128m-12 0A2.25 2.25 0 0 0 4.5 9v.878m13.5-3A2.25 2.25 0 0 1 19.5 9v.878m0 0a2.3 2.3 0 0 0-.75-.128H5.25q-.396.002-.75.128m15 0A2.25 2.25 0 0 1 21 12v6a2.25 2.25 0 0 1-2.25 2.25H5.25A2.25 2.25 0 0 1 3 18v-6c0-.98.626-1.813 1.5-2.122"/></svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 256 256"
class="size-5">
<rect width="256" height="256" fill="none"></rect>
<line x1="208" y1="128" x2="128" y2="208" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
<line x1="192" y1="40" x2="40" y2="192" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
</svg>


</span>
Expand Down
Binary file added site/images/pomdp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 34 additions & 10 deletions site/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,14 @@
<span class="size-8 flex flex-row justify-center items-center">


<svg xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 24 24"><path fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5" d="M6 6.878V6a2.25 2.25 0 0 1 2.25-2.25h7.5A2.25 2.25 0 0 1 18 6v.878m-12 0q.354-.126.75-.128h10.5q.396.002.75.128m-12 0A2.25 2.25 0 0 0 4.5 9v.878m13.5-3A2.25 2.25 0 0 1 19.5 9v.878m0 0a2.3 2.3 0 0 0-.75-.128H5.25q-.396.002-.75.128m15 0A2.25 2.25 0 0 1 21 12v6a2.25 2.25 0 0 1-2.25 2.25H5.25A2.25 2.25 0 0 1 3 18v-6c0-.98.626-1.813 1.5-2.122"/></svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 256 256"
class="size-5">
<rect width="256" height="256" fill="none"></rect>
<line x1="208" y1="128" x2="128" y2="208" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
<line x1="192" y1="40" x2="40" y2="192" fill="none" stroke="currentColor" stroke-linecap="round"
stroke-linejoin="round" stroke-width="32"></line>
</svg>


</span>
Expand Down Expand Up @@ -461,6 +468,7 @@ <h3 id="run-openapps">Run OpenApps</h3>
</code></pre></div>

<p><img alt="landing" src="images/landing.png" /></p>
<p>For an overview, checkout our <a href="https://www.youtube.com/watch?v=gzNW_LXE7OE">video tutorial</a>.</p>
<h3 id="app-variations">App variations</h3>
<p>Each app can be modified with variables available in <code>config/apps</code>. You can override any of these via command line:</p>
<div class="codehilite"><pre><span></span><code>uv<span class="w"> </span>run<span class="w"> </span>launch.py<span class="w"> </span><span class="s1">&#39;apps.todo.init_todos=[[&quot;Call Mom&quot;, false]]&#39;</span>
Expand Down Expand Up @@ -562,18 +570,34 @@ <h2 id="launch-agents-across-multiple-tasks">Launch Agent(s) Across Multiple Tas
<blockquote>
<p>launch thousands of app variations to study agent behaviors in parallel</p>
</blockquote>
<p>coming soon!</p>
<!-- To launch one (or multiple) agents to solve many tasks in parallel, each in an isolated deployment of OpenApps:


<div class="codehilite"><pre><span></span><code>uv run launch_sweep.py
<div class="admonition info">
<p class="admonition-title">Note:</p>
<p>Parallel launching works with SLURM. Be sure to update configs in <code>config/mode/slurm_cluster.yaml</code>.</p>
</div>
<p>You can launch one (or multiple) agents to solve many tasks in parallel, each in an isolated deployment of OpenApps, using SLURM:</p>
<div class="codehilite"><pre><span></span><code>uv run launch_parallel_agents.py mode=slurm_cluster agent=dummy use_wandb=True
</code></pre></div>

<p>This launches 6 parallel independent random click agents to solve each task in each app variation as defined in <code>config_parallel_tasks.yaml</code></p>
<div class="codehilite"><pre><span></span><code><span class="nt">parallel_tasks</span><span class="p">:</span>
<span class="w"> </span><span class="nt">_target_</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">open_apps.tasks.parallel_tasks.AppVariationParallelTasksConfig</span>
<span class="w"> </span><span class="nt">task_names</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">add_meeting_with_dennis</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">add_call_mom_to_my_todo</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">save_paris_to_my_favorite_places</span>
<span class="w"> </span><span class="nt">app_variations</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">[</span><span class="s">&quot;apps/start_page/content=default&quot;</span><span class="p p-Indicator">,</span><span class="w"> </span><span class="s">&quot;apps/calendar/content=german&quot;</span><span class="p p-Indicator">]</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="p p-Indicator">[</span>
<span class="w"> </span><span class="s">&quot;apps/start_page/appearance=dark_theme&quot;</span><span class="p p-Indicator">,</span>
<span class="w"> </span><span class="s">&quot;apps/calendar/appearance=dark_theme&quot;</span><span class="p p-Indicator">,</span>
<span class="w"> </span><span class="p p-Indicator">]</span>
</code></pre></div>


* Note each deployment of OpenApps can have different appearance and content
* Note each task is launched in an isolated environment to ensure reproducible results. -->

<p>You can modify the set of tasks or app variation by updating the <code>config_parallel_tasks.yaml</code>. We ensure:</p>
<ul>
<li>Each deployment of OpenApps can have different appearance and content per app.</li>
<li>Each task is launched in an isolated environment for reproducible results.</li>
</ul>
<h2 id="testing">Testing</h2>
<p>Run all tests via:</p>
<div class="codehilite"><pre><span></span><code><span class="n">uv</span> <span class="n">run</span> <span class="o">-</span><span class="n">m</span> <span class="n">pytest</span> <span class="n">tests</span><span class="o">/</span>
Expand Down
Loading
Loading