Add PPO self-play implementation for OpenSpiel by Arahan-kujur · Pull Request #1519 · google-deepmind/open_spiel

Arahan-kujur · 2026-04-10T15:02:07Z

Adds a PPO (Proximal Policy Optimization) implementation in JAX/Flax (NNX) to OpenSpiel, including a self-play training loop for turn-based imperfect-information games.

Features
Actor-critic PPO agent implemented in JAX + Flax (NNX)
Supports self-play with a single agent controlling all players
Generalized Advantage Estimation (GAE) per player trajectory
Legal action masking for arbitrary OpenSpiel games
Example training script for Kuhn Poker and Leduc Poker
Unit tests covering training, evaluation mode, and self-play behavior
Results

Tested on Kuhn Poker using the example script:

Exploitability: ~0.22 after 500 iterations (entropy_coef=0.1)
Average returns close to game value (-1/18 ≈ -0.056)

This suggests the self-play setup and training loop are functioning as expected.

Notes
Designed as a reference implementation for policy gradient methods in OpenSpiel
PPO does not have convergence guarantees in imperfect-information games
Performance is sensitive to hyperparameters (e.g., entropy regularization)
Files Added
open_spiel/python/jax/ppo.py — PPO agent and training logic
open_spiel/python/examples/ppo_example_jax.py — example self-play training script
open_spiel/python/jax/ppo_jax_test.py — unit tests
Future Work
Scaling to larger games (e.g., Leduc Poker tuning)
Benchmarking against CFR-based methods
Multi-agent extensions or population-based training

Remove .vs/ IDE files, add .vs/ to .gitignore, improve PPO docstrings Made-with: Cursor

alexunderch · 2026-04-23T19:02:40Z

Hello! in general, really cool implementation! It would also be so cool if you implemented GAE calculation in a vectorised form (using jax.lax.scan) because it's a very resource-demanding operation, see: for example, here. Moveover, you mostly use numpy.random generator, but it would be nice if you took some advantage of jax.random reproducability.

Also, it would be nice if you provided some insights on how the algorithm performs for some known games.
Nice work!

Made-with: Cursor

Arahan-kujur · 2026-04-24T11:48:28Z

Hello! in general, really cool implementation! It would also be so cool if you implemented GAE calculation in a vectorised form (using jax.lax.scan) because it's a very resource-demanding operation, see: for example, here. Moveover, you mostly use numpy.random generator, but it would be nice if you took some advantage of jax.random reproducability.

Also, it would be nice if you provided some insights on how the algorithm performs for some known games. Nice work!

Hi! Thanks a lot for the thoughtful feedback — I really appreciate it.

I’ve made several updates based on your suggestions:

Vectorized GAE: Replaced the Python loop with a jax.lax.scan (reverse-time) implementation, making it fully JIT-compatible and more efficient.
JAX PRNG: Switched from numpy.random to jax.random throughout the codebase, with explicit key handling for reproducibility.
Benchmarks: Added evaluation on standard OpenSpiel environments (kuhn_poker, leduc_poker, matrix_pd) along with training curves (policy/value loss, entropy, exploitability).

I also included tests for GAE correctness and PRNG reproducibility, and added documentation explaining the design choices.

Thanks again for the suggestions — they were really helpful in improving both performance and clarity.

alexunderch · 2026-04-29T23:36:23Z

Thank you, I will give the results a look!

alexunderch · 2026-05-09T08:35:16Z

Hey! kunh_poker result looks nice!
Can you also report:

exploitability for the leduc_poker (in a similar manner)?
cumulative return plots/entropy for RPS(pyspiel.load_game("matrix_rps")) and breakout game
delete the readme file

P.s. also, use jax.Array/chex.Array for the type annonation, not jnp.ndarray, please, because it's an official annotation way, I guess: https://docs.jax.dev/en/latest/_autosummary/jax.Array.html

Add PPO self-play implementation for OpenSpiel

0673c37

Remove .vs/ IDE files, add .vs/ to .gitignore, improve PPO docstrings Made-with: Cursor

Refactor PPO: vectorized GAE via scan, jax.random, benchmarks, plotting

5c375f5

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PPO self-play implementation for OpenSpiel#1519

Add PPO self-play implementation for OpenSpiel#1519
Arahan-kujur wants to merge 2 commits into
google-deepmind:masterfrom
Arahan-kujur:ppo-selfplay

Arahan-kujur commented Apr 10, 2026

Uh oh!

alexunderch commented Apr 23, 2026

Uh oh!

Arahan-kujur commented Apr 24, 2026

Uh oh!

alexunderch commented Apr 29, 2026

Uh oh!

alexunderch commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Arahan-kujur commented Apr 10, 2026

Uh oh!

alexunderch commented Apr 23, 2026

Uh oh!

Arahan-kujur commented Apr 24, 2026

Uh oh!

alexunderch commented Apr 29, 2026

Uh oh!

alexunderch commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexunderch commented May 9, 2026 •

edited

Loading