Deep RL Quick Reference & Cheat Sheet

Unreviewed: no one has checked this end to end. Treat the math, code, and citations as unverified.

Quick reference for deep reinforcement learning algorithms, hyperparameters, and best practices.

Algorithm Comparison Table
When to Use Which Algorithm
Hyperparameter Guide
Key Formulas
Common Bugs & Fixes
Implementation Checklist

Algorithm Comparison Table

Algorithm	Type	Actions	Policy	Sample Efficiency	Stability	Complexity	Use Case
REINFORCE	PG	Discrete/Continuous	Stochastic	Very Low	Medium	Low	Learning, simple tasks
DQN	Value	Discrete only	Deterministic	Medium	Medium	Medium	Atari, discrete control
A2C	Actor-Critic	Discrete/Continuous	Stochastic	Medium	Medium	Medium	General purpose
A3C	Actor-Critic	Discrete/Continuous	Stochastic	Medium	Medium	Medium	Parallel training
TRPO	PG + Trust Region	Discrete/Continuous	Stochastic	Medium	High	Very High	Research, safety-critical
PPO	PG + Clipping	Discrete/Continuous	Stochastic	Medium	High	Medium	General purpose
DDPG	Off-Policy AC	Continuous only	Deterministic	High	Low	Medium	Continuous control
TD3	Off-Policy AC	Continuous only	Deterministic	High	High	Medium	Robotics, continuous control
SAC	Off-Policy AC	Continuous only	Stochastic	Very High	Very High	Medium	Continuous control
Dyna-Q	Model-Based	Discrete	Deterministic	High	Medium	Medium	Sample-limited environments
MBPO	Model-Based + SAC	Continuous	Stochastic	Very High	High	High	Expensive real-world data
Dreamer	Model-Based	Discrete/Continuous	Stochastic	Very High	Medium	Very High	High-dim observations

Legend

PG = Policy Gradient
AC = Actor-Critic
Sample Efficiency: Low = millions of steps, High = tens of thousands
Stability: How robust to hyperparameters
Complexity: Implementation difficulty

When to Use Which Algorithm

Decision Tree

Are actions continuous or discrete?
│
├─ Discrete
│  │
│  ├─ Do you need off-policy (data reuse)?
│  │  ├─ Yes → DQN
│  │  └─ No → PPO
│  │
│  └─ Is sample efficiency critical?
│     ├─ Yes → DQN or Model-Based (Dyna-Q)
│     └─ No → PPO
│
└─ Continuous
   │
   ├─ Do you need maximum sample efficiency?
   │  ├─ Yes → SAC or MBPO
   │  └─ No → PPO
   │
   ├─ Is safety/stability critical?
   │  ├─ Yes → TRPO or PPO
   │  └─ No → SAC or TD3
   │
   └─ Do you have parallel environments?
      ├─ Yes → PPO
      └─ No → SAC

By Domain

Game playing (Atari, board games)

Discrete actions → DQN
Lots of compute → AlphaZero / MuZero (model-based)
Limited compute → PPO

Robotics

Continuous actions → SAC or TD3
Expensive real data → MBPO
Safety critical → TRPO (but slow)
Default choice: SAC

LLM alignment (RLHF)

PPO is the standard choice
On-policy training is needed for the KL penalty; handles sparse rewards well
See ../lectures/10-ppo-for-llms.md for details

Code generation

Use PPO with execution-based rewards
See ../lectures/13-rlhf-code-generation.md for details

Recommender systems

Discrete actions → DQN
Continuous → SAC
Need interpretability → Model-based

Hyperparameter Guide

Learning Rates

Algorithm	Actor LR	Critic LR	Notes
REINFORCE	1e-3 to 1e-4	N/A	Start high, decay
DQN	N/A	1e-4 to 1e-3	Lower for stability
A2C/A3C	3e-4	Same	Shared optimizer
PPO	3e-4	3e-4	Very robust
TD3	3e-4	3e-4	Same for both
SAC	3e-4	3e-4	Can use 1e-3 for faster learning

Rule of thumb: Start with 3e-4, decrease if unstable, increase if too slow.

Discount Factor (γ)

Episode Length	Recommended γ	Reasoning
Short (<100 steps)	0.99	Standard
Medium (100-1000)	0.99	Standard
Long (>1000 steps)	0.995 - 0.999	Need longer horizon
Infinite horizon	0.99	Prevent vanishing
Sparse rewards	0.99 - 0.999	Higher for delayed rewards

Default: γ = 0.99 works for most cases.

Buffer Sizes (Off-Policy Methods)

Environment	Replay Buffer Size	Reasoning
Simple (CartPole)	100K	Small state space
Complex (Atari)	1M	Need diverse experiences
Continuous Control	1M	Standard
Very complex	10M	If you have memory

Default: 1M for most off-policy methods.

Batch Sizes

Algorithm	Batch Size	Notes
REINFORCE	Full episode	By definition
DQN	32 - 128	Smaller for stability
A2C	Full episode or 5-20 steps	N-step returns
PPO	64 - 4096	Depends on parallel envs
TD3/SAC	256 - 1024	Larger is more stable

Default: 256 for off-policy, 64-256 for on-policy.

PPO Specific

Parameter	Recommended	Reasoning
ε (clip)	0.2	Standard, very robust
GAE λ	0.95	Balances bias-variance
Epochs	3-10	More for large batches
Minibatch size	64	For large batch training
Entropy coefficient	0.01	Lower for deterministic tasks
Value coefficient	0.5	Standard

SAC Specific

Parameter	Recommended	Reasoning
τ (soft update)	0.005	Slow target updates
Initial α (entropy)	0.2	Auto-tuned in practice
Target entropy	-dim(action)	Heuristic
Updates per step	1	Can use 2-4 for sample efficiency

Exploration Parameters

Method	Parameter	Recommended	Notes
ε-greedy (DQN)	ε_start	1.0	Full exploration initially
	ε_end	0.01	Minimum exploration
	ε_decay	0.995	Per episode
Gaussian noise (DDPG/TD3)	σ (std)	0.1 - 0.2	As fraction of action range
Entropy (SAC)	α	Auto-tuned	Let it learn

Key Formulas

Value Functions

State Value

V^π(s) = E_π[R_t | S_t = s] = E_π[Σ_{k=0}^∞ γ^k r_{t+k} | S_t = s]

Action Value (Q-function)

Q^π(s,a) = E_π[R_t | S_t = s, A_t = a] = E_π[Σ_{k=0}^∞ γ^k r_{t+k} | S_t = s, A_t = a]

Advantage

A^π(s,a) = Q^π(s,a) - V^π(s)

Bellman Equations

Bellman Expectation (Value)

V^π(s) = Σ_a π(a|s) [r(s,a) + γ Σ_{s'} p(s'|s,a) V^π(s')]

Bellman Optimality (Value)

V*(s) = max_a [r(s,a) + γ Σ_{s'} p(s'|s,a) V*(s')]

Bellman Optimality (Q-function)

Q*(s,a) = r(s,a) + γ Σ_{s'} p(s'|s,a) max_{a'} Q*(s',a')

Policy Gradient

REINFORCE

∇_θ J(θ) = E_τ~π_θ [Σ_t ∇_θ log π_θ(a_t|s_t) R_t]

With Baseline

∇_θ J(θ) = E_τ~π_θ [Σ_t ∇_θ log π_θ(a_t|s_t) (R_t - b(s_t))]

Actor-Critic

∇_θ J(θ) = E_τ~π_θ [Σ_t ∇_θ log π_θ(a_t|s_t) A^π(s_t, a_t)]

TD Learning

TD Error

δ_t = r_t + γV(s_{t+1}) - V(s_t)

Q-Learning Update

Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]

SARSA Update

Q(s,a) ← Q(s,a) + α[r + γ Q(s',a') - Q(s,a)]

GAE (Generalized Advantage Estimation)

A_t^GAE = Σ_{l=0}^∞ (γλ)^l δ_{t+l}

where δ_t = r_t + γV(s_{t+1}) - V(s_t)

PPO

Clipped Objective

L^CLIP(θ) = E_t[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)

SAC (Maximum Entropy)

Objective

J(π) = E_τ~π[Σ_t r(s_t, a_t) + α H(π(·|s_t))]

where H(π(·|s)) = -Σ_a π(a|s) log π(a|s)

Soft Bellman Backup

Q*(s,a) = r(s,a) + γ E_{s'}[V*(s')]
V*(s) = E_{a~π}[Q*(s,a) - α log π(a|s)]

Common Bugs & Fixes

Bug 1: Training Doesn't Start / Stuck at Random Policy

Symptoms:

Reward stays at random baseline
Policy doesn't improve after many episodes

Possible Causes:

Learning rate too low
Rewards not normalized
Gradient vanishing
Network architecture too complex

Fixes:

# Fix 1: Increase learning rate
optimizer = Adam(params, lr=1e-3)  # Try 10x higher

# Fix 2: Normalize rewards
rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

# Fix 3: Check gradient norms
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10)
print(f"Grad norm: {grad_norm}")  # Should be 0.1-10

# Fix 4: Simplify network
# Use 2 layers with 64-128 hidden units instead of deeper

Bug 2: Training Collapses / Performance Drops

Symptoms:

Was learning, suddenly drops to random
Can't recover

Possible Causes:

Learning rate too high
No gradient clipping
PPO: too many epochs or no early stopping
Off-policy: target network not updating

Fixes:

# Fix 1: Lower learning rate or use scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.9)

# Fix 2: Clip gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)

# Fix 3: PPO early stopping on KL divergence
if kl > 1.5 * target_kl:
    break  # Stop epoch early

# Fix 4: Check target network updates
if step % target_update_freq == 0:
    target_net.load_state_dict(policy_net.state_dict())

Bug 3: High Variance / Unstable Learning

Symptoms:

Reward has huge swings
Can't tell if learning

Possible Causes:

No advantage normalization
Batch size too small
No entropy regularization
γ too high for sparse rewards

Fixes:

# Fix 1: Always normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

# Fix 2: Increase batch size
batch_size = 256  # Instead of 32

# Fix 3: Add entropy bonus
loss = policy_loss + 0.5 * value_loss - 0.01 * entropy

# Fix 4: Lower gamma for sparse rewards
gamma = 0.95  # Instead of 0.99

Bug 4: Off-Policy Methods Not Learning

Symptoms:

DQN/SAC/TD3 stuck at random

Possible Causes:

Replay buffer not full enough
Updating too early
Target network issues
Action scaling wrong

Fixes:

# Fix 1: Warmup period
if len(replay_buffer) < 10000:
    continue  # Don't train yet

# Fix 2: Update frequency
if step % 4 == 0:  # Update every 4 steps, not every step
    agent.update()

# Fix 3: Soft update target networks
for target_param, param in zip(target_net.parameters(), net.parameters()):
    target_param.data.copy_(0.005 * param.data + 0.995 * target_param.data)

# Fix 4: Check action range
# Network outputs in [-1, 1], scale to actual range
action = network(state) * max_action

Bug 5: Works in Training, Fails in Eval

Symptoms:

Good reward during training
Terrible at test time

Possible Causes:

Using training mode at test time
Exploration noise at test time
Overfitting to specific states

Fixes:

# Fix 1: Set eval mode
model.eval()  # Disables dropout, batch norm

# Fix 2: No noise during evaluation
if training:
    action = policy(state) + noise
else:
    action = policy(state)  # Deterministic

# Fix 3: More diverse training data
# Increase exploration, vary initial states

Implementation Checklist

Before Training

Environment wrapper: Normalized observations? Normalized rewards?
Action space: Continuous in [-1, 1]? Discrete properly handled?
Network architecture: Not too deep (2-3 layers usually enough)
Initialization: Orthogonal or Xavier initialization
Optimizer: Adam with lr=3e-4 is good default
Gradient clipping: Max norm = 0.5-10
Hyperparameters: Copied from paper or baseline?

During Training

Logging: Track episode reward, loss, entropy, grad norms
Visualization: Plot rewards every 10-50 episodes
Checkpointing: Save model every N episodes
Early stopping: If training diverges, stop and debug
Ablation: Test each component separately

Debugging

Sanity checks: Can it learn a trivial task? (e.g., CartPole)
Gradient check: Are gradients flowing? Not exploding?
Overfit single episode: Can it memorize one trajectory?
Hyperparameter sweep: Try 3-5 learning rates
Compare to baseline: Does it match published results?

Code Quality

Vectorized operations: Use batch processing, not loops
Reproducibility: Set random seeds
Documentation: Comment complex sections
Tests: Unit tests for network forward pass, advantage computation
Profiling: Is training time reasonable? Any bottlenecks?

Quick Troubleshooting

My algorithm is slower than paper says:

Check if you're using GPU (model.to('cuda'))
Vectorize environment (multiple parallel envs)
Profile code to find bottlenecks
Use mixed precision training

My results don't match the paper:

Verify exact same environment version
Check if paper uses pre-processing (frame stacking, normalization)
Try their exact hyperparameters
Check if paper does hyperparameter search (published = best of many runs)

It works on CartPole but not my task:

Your task might need more capacity (bigger network)
Or domain-specific tricks (frame stacking for vision)
Or longer training (Atari = 10M steps, not 1M)
Or different algorithm (continuous control needs SAC/TD3, not DQN)

Training takes forever:

Use vectorized environments (10-100x speedup)
Use GPU if available
Try more sample-efficient algorithm (off-policy > on-policy)
Consider distributed training (Ray RLlib, Stable-Baselines3)

Recommended Baselines

Prefer these over implementing from scratch:

Stable-Baselines3 (Python/PyTorch)

pip install stable-baselines3

Best for: Getting results quickly
Algorithms: PPO, A2C, DQN, SAC, TD3
Quality: Production-ready, well-tested

CleanRL (Python/PyTorch)

pip install cleanrl

Best for: Learning implementations
Algorithms: Most major algorithms
Quality: Single-file implementations, very readable

Ray RLlib (Python/PyTorch)

pip install ray[rllib]

Best for: Distributed training, scaling up
Algorithms: Everything
Quality: Production-ready, used in industry

Spinning Up (Python/PyTorch+TF)

Best for: Educational implementations
Algorithms: VPG, PPO, TRPO, DDPG, TD3, SAC
Quality: Clean, well-documented

Further Learning

Must-read papers (in order)

DQN (2015): Nature paper that started deep RL, arXiv:1312.5602
A3C (2016): Parallel actor-critic
TRPO (2015): Trust regions, arXiv:1502.05477
GAE (2015): Generalized advantage estimation, arXiv:1506.02438
PPO (2017): Practical policy gradients, arXiv:1707.06347
DDPG (2015): Deterministic continuous control, arXiv:1509.02971
TD3 (2018): Addressing function approximation errors, arXiv:1802.09477
SAC (2018): Maximum entropy RL, arXiv:1801.01290
Double DQN (2015): Reducing overestimation bias, arXiv:1509.06461
Dueling DQN (2015): Separate value/advantage streams, arXiv:1511.06581
DPO (2023): Direct preference optimization, arXiv:2305.18290
GRPO / DeepSeekMath (2024): Group relative policy optimization, arXiv:2402.03300
InstructGPT (2022): RLHF for LLM alignment, arXiv:2203.02155
Constitutional AI (2022): RLAIF, arXiv:2212.08073
DeepSeek-R1 (2025): RL for reasoning, arXiv:2501.12948

Resources

Courses:

CS285 (Berkeley): rail.eecs.berkeley.edu/deeprlcourse
Spinning Up: spinningup.openai.com

Books:

Sutton & Barto: "Reinforcement Learning: An Introduction" (Free online)
Sergey Levine's lecture notes (CS285)

Blogs:

Lilian Weng: lilianweng.github.io
OpenAI Spinning Up: spinningup.openai.com/en/latest/

Communities:

r/reinforcementlearning
RL Discord servers
Papers With Code (RL section)

Quick reference card

┌─────────────────────────────────────────────────────────┐
│                    DEEP RL CHEAT SHEET                  │
├─────────────────────────────────────────────────────────┤
│ ALGORITHM SELECTION                                     │
│                                                         │
│ Discrete actions:                                       │
│   • Sample efficient? → DQN                            │
│   • General purpose? → PPO                             │
│                                                         │
│ Continuous actions:                                     │
│   • Best performance? → SAC                            │
│   • General purpose? → PPO                             │
│   • Need deterministic? → TD3                          │
│                                                         │
│ Limited real data? → Model-Based (MBPO)                │
│ LLM training? → PPO                                     │
├─────────────────────────────────────────────────────────┤
│ HYPERPARAMETERS (defaults)                              │
│                                                         │
│ Learning rate:    3e-4                                  │
│ Batch size:       256 (off-policy), 64 (on-policy)    │
│ Gamma (γ):        0.99                                 │
│ Replay buffer:    1M                                    │
│ PPO clip (ε):     0.2                                  │
│ GAE lambda (λ):   0.95                                 │
│ Gradient clip:    0.5-10                               │
├─────────────────────────────────────────────────────────┤
│ DEBUGGING CHECKLIST                                     │
│                                                         │
│ ☐ Normalize observations & rewards                     │
│ ☐ Clip gradients                                       │
│ ☐ Normalize advantages                                 │
│ ☐ Add entropy bonus (0.01)                            │
│ ☐ Use orthogonal init                                  │
│ ☐ Check action scaling                                 │
│ ☐ Replay buffer warmup (off-policy)                   │
│ ☐ Test on CartPole first                              │
├─────────────────────────────────────────────────────────┤
│ COMMON FORMULAS                                         │
│                                                         │
│ Advantage: A(s,a) = Q(s,a) - V(s)                     │
│ TD error: δ = r + γV(s') - V(s)                       │
│ PPO ratio: r = π_new(a|s) / π_old(a|s)               │
│ Clipped obj: min(r·A, clip(r,1-ε,1+ε)·A)            │
└─────────────────────────────────────────────────────────┘

Part of the Deep RL lecture series. See ../lectures/ for full lecture notes.

FilesExpand file tree

RL-Quick-Reference.md

Latest commit

History

RL-Quick-Reference.md

File metadata and controls

Deep RL Quick Reference & Cheat Sheet

Table of Contents

Algorithm Comparison Table

Legend

When to Use Which Algorithm

Decision Tree

By Domain

Hyperparameter Guide

Learning Rates

Discount Factor (γ)

Buffer Sizes (Off-Policy Methods)

Batch Sizes

PPO Specific

SAC Specific

Exploration Parameters

Key Formulas

Value Functions

Bellman Equations

Policy Gradient

TD Learning

GAE (Generalized Advantage Estimation)

PPO

SAC (Maximum Entropy)

Common Bugs & Fixes

Bug 1: Training Doesn't Start / Stuck at Random Policy

Bug 2: Training Collapses / Performance Drops

Bug 3: High Variance / Unstable Learning

Bug 4: Off-Policy Methods Not Learning

Bug 5: Works in Training, Fails in Eval

Implementation Checklist

Before Training

During Training

Debugging

Code Quality

Quick Troubleshooting

Recommended Baselines

Further Learning

Must-read papers (in order)

Resources

Quick reference card