🏆 Leaderboard · 📄 Paper · 🤗 Models
Alem is a JAX benchmark for open-ended multi-agent coordination. Building on Craftax and Multi-Agent Craftax / Craftax-Coop, Alem introduces procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. The same world is exposed through symbolic, pixel, and text interfaces, making it usable by MARL agents, language agents, and humans.
Alem means world in Amharic.
RL Playing · LLM Playing · Install · Quick Start · Configure · RL Agents · LLM Agents · Baselines · Human Play · Docker · Package Layout · Development · RL vs LLM Interfaces · Reproduce the Paper · Contributing · Citation · License
A team of MARL agents controlling the three players from symbolic observations, each acting from its own egocentric view.
Fast to train end-to-end in JAX. Full MARL training code and reference baselines live in baselines/ — see Baselines.
The same world through the text interface. Each agent gets its own observation, broadcasts a free-form message to teammates every step and stores important information in scratchpad memory.
Gemini 3.1 Pro (medium). THINKING = the agent's private plan; MESSAGE = what it broadcasts to the team.
Each panel is held for several seconds so the reasoning is readable — this is not the agents' real decision speed.
The warrior plans turns ahead and predicts how a teammate will react — then it happens:
- Plans ahead. A turn-indexed plan
T87→T95, with a fallback for the warrior's crafting-penalty. - Theory of mind. "A2 will get my message at T88, so they'll cancel their T90 action and wait for T95" — and at T88, A2 does exactly that.
- Coordinates out loud. Lines all three up for a synchronous mine to earn the Coord Mine Stone Hard bonus.
See LLM Agents to run it yourself.
👁️ Click to see what the agents actually see
Every step a language agent gets a system prompt (the rules, sent once) and a text observation (its current view), and must reply with an <action>, an optional <communication> broadcast, and an optional private <scratchpad>. We use progressive disclosure, where we only give relevant information for the current level in the prompt, and add information as agents get to more levels.
System prompt template — placeholders in {…} are filled per agent/run (abridged; the full rules are sent verbatim):
You are Agent {id} ({role}) in a {num_agents}-agent cooperative survival game. Your goal is to gather resources, craft gear, fight monsters, and descend through {num_levels} dungeon levels, while coordinating with teammates. You must survive — if your health reaches zero, you die, and if all agents die the game ends. Maximize achievements while alive.
[ … full game rules: movement & facing, Do/interaction, crafting recipes, roles & specialist penalties, coordination (sync / handover / construction), the resource chain, and the {num_achievements} achievements …]
<output_format>
1. (Required) Exactly one action from the available list:
<action>YOUR_CHOSEN_ACTION</action>
2. (Optional) Broadcast to teammates, up to {comm_char_limit} chars:
<communication>YOUR_MESSAGE</communication>
3. (Optional) Private notes, up to {scratchpad_char_limit} chars — not shared; your only memory:
<scratchpad>YOUR_NOTES</scratchpad>
Token budget: {token_budget} tokens for the full response (including reasoning).
</output_format>
View the full system prompt, filled in (a concrete 3-agent example on overworld).
Observation template — the structure every agent receives each step:
Step: {step}/{max_steps} ({steps_remaining} remaining, ends early if all agents die)
Position: (x={x}, y={y})
Role: {role}
Location: {dungeon_level}
Achievements: {unlocked}/{total} ({locked} unlock later)
You see:
- {object} {relative_position} (x={x}, y={y})
...one line per visible object...
Facing: {direction}. Do target: {object_in_front} (x={x}, y={y}).
Coordination:
- {object} (x={x}, y={y}): {how_this_object_must_be_coordinated}
...one line per coordination-relevant object...
Teammates:
Agent {id} ({role}): {relative_position} (x={x}, y={y}), health={hp}
...one line per teammate...
Your status: health {hp}, food {food}, drink {drink}, energy {energy}, mana {mana}, xp {xp}
Available actions: {legal_actions_this_step}
Filled-in example — what the warrior actually sees at step 0:
Step: 0/10000 (10000 remaining, ends early if all agents die)
Position: (x=24, y=24)
Role: warrior
Location: Overworld (surface)
Achievements: 0/93 (39 unlock later)
You see:
- stone 5 steps east (x=29, y=24)
- tree 1 step north and 2 steps west (x=22, y=23)
- construction_site 2 steps north (x=24, y=22)
- iron 3 steps south and 5 steps east (x=29, y=27)
Facing: north. Do target: grass (x=24, y=23).
Coordination:
- construction_site (x=24, y=22): requires 3 agents to Build simultaneously (fails alone).
- tree (x=26, y=23): one agent begins, another completes it within 6 steps (handover).
- stone (x=20, y=21): works solo, but a bonus when 3 agents Do simultaneously.
Teammates:
Agent 1 (forager): 1 step east (x=25, y=24), health=9
Agent 2 (miner): 1 step south (x=24, y=25), health=9
Your status: health 9, food 9, drink 9, energy 9, mana 9, xp 0
Available actions: Noop, Move {West,East,North,South}, Do, Sleep, Rest, Request {Food,Drink,Wood,Stone,Iron,Coal,Diamond,Ruby,Sapphire}
pip install alem-env # latest release from PyPIOr from source for development (editable install):
uv venv --python 3.12
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate # Windows
uv pip install -e .Optional extras (work with either alem-env or -e .):
uv pip install -e ".[llm]" # OpenAI-compatible LLM interface
uv pip install -e ".[play]" # pygame human-play example
uv pip install -e ".[gpu]" # NVIDIA CUDA 12 JAX wheels
uv pip install -e ".[baselines-rl]" # JAX MARL trainers (see Baselines)
uv pip install -e ".[baselines-llm]" # LLM-agent evaluation harness
uv pip install -e ".[dev]" # pytest, ruff, jaxtypingPlain pip also works (no uv required):
pip install -e .
pip install -e ".[gpu]"Running scripts with uv: Commands below use
uv run python …, which uses.venvwithout a priorsource activate(activating once and callingpython …also works). Note: it'suv run python script.py—uv python script.pyis not valid.
import jax
from alem.alem_env import make_alem_env_from_name
env = make_alem_env_from_name("Alem-Coop-Symbolic")
obs, state = env.reset(jax.random.PRNGKey(0))
rng_act = jax.random.split(jax.random.PRNGKey(1), env.num_agents)
actions = {
agent: env.action_space(agent).sample(rng_act[i])
for i, agent in enumerate(env.agents)
}
obs, state, rewards, dones, infos = env.step(jax.random.PRNGKey(2), state, actions)Available environments:
| Name | Description |
|---|---|
Alem-Coop-Symbolic |
Full multi-agent environment, symbolic observations |
Alem-Coop-Pixels |
Full multi-agent environment, pixel observations |
Alem-Coop-Symbolic-Debug |
Smaller debug environment, only overworld (first floor). |
Alem-SingleAgent-Symbolic |
Single-agent variant (experimental) |
from alem.alem_coop.alem_state import EnvParams, StaticEnvParams, get_coordination_params
env_params = EnvParams().replace(
**get_coordination_params("easy"),
soft_specialization=True,
shared_reward=False,
)
static_env_params = StaticEnvParams(player_count=3, num_comm_channels=4)
env = make_alem_env_from_name(
"Alem-Coop-Symbolic",
env_params=env_params,
static_env_params=static_env_params,
)Coordination difficulty can be "none", "easy", "medium", "hard", or a numeric alpha in [0, 1].
A minimal framework-free rollout using symbolic observations and legal action masks:
uv run python examples/random_rl_agent.py --coord easy --steps 100
uv run python examples/random_rl_agent.py --players 2 --coord hard --steps 200The example uses a jitted lax.scan loop and can serve as a template for custom policies. Full MARL training recipes (IPPO, HyperMARL-IPPO, MAPPO, PQN-VDN) live in baselines/ — see Baselines.
Preview 3-agent text observations without any model calls:
uv run python examples/llm_text_smoke.py --coord easy --show-affordancesRun one 3-agent step with any OpenAI-compatible model:
export OPENAI_API_KEY=sk-...
uv run python examples/llm_openai_smoke.py --model gpt-4o-mini --steps 1Local vLLM server:
uv run python examples/llm_openai_smoke.py \
--base-url http://localhost:8000/v1 \
--api-key EMPTY \
--model meta-llama/Llama-3.2-1B-Instruct \
--steps 1Full LLM evaluation runners live in baselines/llm/ — see Baselines.
Reference MARL training code and the LLM-agent evaluation harness live in this repo under baselines/. Following the CleanRL philosophy — and JaxMARL, which these are adapted from — each RL algorithm is a single self-contained file with a matching Hydra config in baselines/config/.
Install only the set you need:
uv pip install -e ".[baselines-rl]" # JAX MARL trainers (IPPO / MAPPO / PQN-VDN / HyperMARL)
uv pip install -e ".[baselines-llm]" # LLM-agent evaluation harness| Algorithm | Entry point | Reference |
|---|---|---|
| IPPO (RNN, shared params) | baselines/ippo_rnn.py |
IPPO |
| IPPO (RNN, no param sharing) | baselines/ippo_rnn_nops.py |
IPPO |
| HyperMARL-IPPO (RNN) | baselines/ippo_hypermarl_rnn.py |
HyperMARL (code) |
| MAPPO (RNN) | baselines/mappo_rnn.py |
MAPPO |
| PQN-VDN (RNN) | baselines/pqn_vdn_rnn.py |
PQN (code) |
Run the baselines from the baselines/ directory and override config values on the command line:
cd baselines
python ippo_rnn.py
python mappo_rnn.py coordination_difficulty=hard # override any config valuePretrained RL checkpoints from the paper are on the Hugging Face Hub at
alem-world/alem-rl-baselines:
120 checkpoints = 2 training budgets (100M, 1B env steps) × 4 algorithms × 3
difficulties × 5 seeds, laid out as <budget>/<algorithm>/<difficulty>/seed<N>/.
Each trainer can skip training and instead restore a saved checkpoint, then run the
same final evaluation (and visualization) used after training. Download the checkpoints,
then pass LOAD_CHECKPOINT pointing at the checkpoint directory:
# 1. Download the checkpoints (needs: pip install -U huggingface_hub)
hf download alem-world/alem-rl-baselines --local-dir alem-rl-baselines
# 2. Reload and evaluate an IPPO policy (note NUM_COMM_CHANNELS=4)
cd baselines
python ippo_rnn.py \
+LOAD_CHECKPOINT=../alem-rl-baselines/1B/ippo-rnn/hard/seed0/checkpoint \
NUM_COMM_CHANNELS=4 \
EVAL_DIFFICULTIES=[hard] \
+VISUALIZE=TrueGifs are saved in ./outputs/, set VISUALIZE=False to skip rendering and only run the numeric evaluation.
Important — the config must match how the checkpoint was trained. Checkpoint shapes are fixed at training time, so the env config (number of agents, communication channels, etc.) must match or the restore will fail with a shape mismatch. The released checkpoints were all trained with 4 communication channels, so load them with
NUM_COMM_CHANNELS=4. The exact overrides for any checkpoint are stored underreload_overridesin itsconfig.json.
The harness (derived from BALROG) drives 3 language agents through the text interface and supports vLLM, OpenAI, Anthropic, Gemini, and other OpenAI-compatible providers. See baselines/llm/README.md for full launch commands and configuration.
cd baselines/llm
export OPENAI_API_KEY=sk-...
python eval_alem.py \
clients.0.client_name=openai \
clients.1.client_name=openai \
clients.2.client_name=openaiInstall the optional play dependencies first:
uv pip install -e ".[play]"uv run python examples/play_alem.py
uv run python examples/play_alem.py --players 3 --coord easy --seed 42
uv run python examples/play_alem.py --players 2 --coord hard --god| Key | Action |
|---|---|
W A S D |
Move |
Space |
Do / interact |
Tab |
Sleep |
E |
Rest |
. / , |
Descend / ascend |
Backspace |
Give to teammate |
Q |
No-op |
The game advances after all players have chosen an action.
Build:
# CPU (default)
docker build -f docker/Dockerfile.env -t alem-env .
# GPU — NVIDIA CUDA 12
docker build -f docker/Dockerfile.env --build-arg ALEM_ACCELERATOR=cuda12 -t alem-env:gpu .
# With optional extras (e.g. LLM + play)
docker build -f docker/Dockerfile.env --build-arg ALEM_EXTRAS=llm,play -t alem-env:extras .Run:
The image uses the system Python (UV_SYSTEM_PYTHON=1), so inside a container you call python directly — no uv run prefix needed.
# Smoke test — confirms the install works (default CMD)
docker run --rm alem-env
docker run --rm --gpus all alem-env:gpu
# Run examples
docker run --rm alem-env python examples/random_rl_agent.py --steps 20
docker run --rm alem-env python examples/llm_text_smoke.py --coord easy
# LLM smoke test (pass your API key)
docker run --rm -e OPENAI_API_KEY=$OPENAI_API_KEY alem-env:extras \
python examples/llm_openai_smoke.py --model gpt-4o-mini --steps 1
# Interactive shell
docker run --rm -it alem-env bashHuman play is easiest natively —
uv pip install -e ".[play]"thenuv run python examples/play_alem.py. Pygame opens a real window with no display plumbing.
Running human play inside Docker (X11 setup)
Inside Docker, human play needs an X11 display and an image built with the play extra (so the SDL/X11 libs are present):
# Build with the play extra (adds pygame + SDL/X11 runtime libs)
docker build -f docker/Dockerfile.env --build-arg ALEM_EXTRAS=play -t alem-env:play .- Linux: grant the container access to your X server first (this is the step that's usually missing when the window never appears), then run it:
xhost +local:root docker run --rm -it --network host -e DISPLAY=$DISPLAY \ -v /tmp/.X11-unix:/tmp/.X11-unix alem-env:play python examples/play_alem.py xhost -local:root # revoke when done
- macOS / Windows: start XQuartz (macOS) or VcXsrv (Windows), enable "allow connections from network clients", then set
DISPLAYaccordingly.
Repository map — where each piece lives
| Path | Purpose |
|---|---|
alem/alem_env.py |
Environment factory |
alem/alem_coop/envs/ |
Symbolic, pixel, debug, and single-agent env classes |
alem/alem_coop/alem_state.py |
State dataclasses and EnvParams / StaticEnvParams |
alem/alem_coop/game_logic.py |
Step logic: movement, resources, combat, crafting, coordination |
alem/alem_coop/world_gen/ |
Procedural world generation |
alem/alem_coop/renderer/ |
Symbolic, pixel, and text renderers |
alem/alem_coop/constants.py |
Actions, achievements, blocks, items, textures |
alem/llm/ |
Text observations, action parsing, ASCII maps, LLM evaluator adapters |
examples/random_rl_agent.py |
Masked-random symbolic rollout (RL template) |
examples/llm_text_smoke.py |
Preview text observations without model calls |
examples/llm_openai_smoke.py |
One-step OpenAI-compatible LLM smoke test |
examples/play_alem.py |
Human pygame player |
uv pip install -e ".[dev]" # pytest, ruff, jaxtyping
uv run pytest alem/tests/ # run the test suiteLint & format (ruff)
Code style is enforced with ruff (config in pyproject.toml). CI runs these checks before the test suite, so run them locally first:
uv run ruff check . # lint
uv run ruff format --check . # verify formattingTo auto-fix before committing:
uv run ruff check --fix . # apply safe lint fixes
uv run ruff format . # format the codeBoth interfaces drive the same environment but are not directly comparable -- treat cross-paradigm scores as indicative, not head-to-head.
| MARL (symbolic) | LLM (text) | |
|---|---|---|
| Observation | Numeric vector | Natural-language text |
| Communication | A discrete signal on one of num_comm_channels (e.g. 4) |
Free-form text, ≤ 400 chars |
| Comms vs acting | Costs your action that turn | Sent alongside the action |
| Memory | Recurrent hidden state | Private <scratchpad> notes — not shared; the agent's only memory across steps. |
| Learning | Trained from reward | Zero-shot |
(Request/Give resource transfers are ordinary actions in both.)
Text observations apply lightweight preprocessing, including compact local-state summaries (inspired by BALROG) and action affordances. See the language wrapper for details.
The full experiments from the paper — the 13-LLM evaluation and the RL baselines (IPPO, HyperMARL-IPPO, MAPPO, PQN-VDN) — live in baselines/; see Baselines for launch commands and configs.
The paper's numbers were produced against Alem v0.1.0. For the exact settings to use when reporting an Alem number — seeds, episode count, metrics — see the canonical evaluation protocol.
Contributions are welcome — new baselines, bug fixes, docs, and coordination tasks. See CONTRIBUTING.md for the dev setup, lint/test workflow, and PR checklist, and CODE_OF_CONDUCT.md for community expectations. To put a result on the leaderboard, follow the submission instructions there.
@article{tessera2026alem,
title = {Benchmarking Open-Ended Multi-Agent Coordination in Language Agents},
author = {Tessera, {Kale-ab} Abebe and Szecsenyi, Andras and Barker, Cameron and
Rutherford, Alexander and Paglieri, Davide and Scannell, Aidan and
Gouk, Henry and Crowley, Elliot J. and Rockt\"{a}schel, Tim and
Storkey, Amos},
year = {2026},
url = {https://arxiv.org/abs/2606.08340}
}MIT. See LICENSE.

