Reinforcement learning studies how agents learn to make decisions through interaction with an environment. Unlike supervised learning, where models learn from fixed input-label pairs, reinforcement learning focuses on sequential decision making, where actions affect future states and rewards.

For my PhD preparation, reinforcement learning is important because it connects perception with action. Although my main research focuses on 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models, autonomous agents ultimately need to use perception outputs for planning, control, and decision making.

This note is my long-term study record for reinforcement learning and decision making.

Roadmap

This note is organized into the following chapters:

Sequential Decision Making
Agent-environment interaction, states, actions, rewards, policies, trajectories, and return.
Markov Decision Processes
MDP formulation, transition dynamics, reward functions, value functions, Bellman equations, and optimality.
Dynamic Programming
Policy evaluation, policy improvement, policy iteration, value iteration, and planning with known models.
Model-Free Prediction and Control
Monte Carlo methods, Temporal-Difference learning, SARSA, Q-learning, and eligibility traces.
Deep Reinforcement Learning
DQN, policy gradients, actor-critic methods, PPO, SAC, and training stability issues.
Model-Based Reinforcement Learning
Learning dynamics models, planning with learned models, world models, imagination rollouts, and uncertainty.
Multi-Agent Reinforcement Learning
Cooperative and competitive settings, decentralized execution, centralized training, communication, and coordination.
Connections to Autonomous Driving and Embodied AI
Planning, control, world models, perception-action loops, and safety-critical decision making.

1. Sequential Decision Making

Reinforcement learning models the problem of an agent interacting with an environment over time.

At each time step (t):

the agent observes a state (s_t);
the agent selects an action (a_t);
the environment transitions to the next state (s_{t+1});
the agent receives a reward (r_t).

This loop can be written as:

\[s_t \rightarrow a_t \rightarrow r_t, s_{t+1}.\]

The goal is to learn a policy that maximizes long-term reward.

1.1 Agent and Environment

An agent is the decision maker. An environment is everything the agent interacts with.

Examples:

In a game, the agent is the player and the environment is the game world.
In robotics, the agent is the robot and the environment is the physical world.
In autonomous driving, the agent is the vehicle and the environment includes roads, traffic participants, signals, and maps.

For embodied AI, this interaction loop is central because perception, memory, decision making, and action are coupled.

1.2 State, Action, and Reward

A state (s_t) describes the situation at time (t).

Examples:

robot pose and sensor observations;
vehicle speed and surrounding objects;
BEV occupancy representation;
memory state of an embodied agent.

An action (a_t) is what the agent chooses to do.

Examples:

move forward;
turn left;
accelerate or brake;
query another agent for information;
select a token budget for communication.

A reward (r_t) measures how good the action is.

Examples:

reaching a goal;
avoiding collision;
reducing travel time;
improving perception accuracy;
reducing communication cost.

1.3 Policy

A policy defines how the agent chooses actions.

A deterministic policy is:

\[a_t = \pi(s_t).\]

A stochastic policy is:

\[a_t \sim \pi(a|s_t).\]

In deep reinforcement learning, the policy is often represented by a neural network:

\[\pi_\theta(a|s).\]

The policy is the main object being learned.

1.4 Trajectory and Return

A trajectory is a sequence of states, actions, and rewards:

\[\tau = (s_0,a_0,r_0,s_1,a_1,r_1,\ldots).\]

The return is the discounted sum of future rewards:

\[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k},\]

where (\gamma \in [0,1]) is the discount factor.

The discount factor controls how much the agent cares about future rewards.

If (\gamma) is close to 0, the agent focuses on immediate reward.
If (\gamma) is close to 1, the agent considers long-term reward.

2. Markov Decision Processes

A Markov Decision Process, or MDP, is the standard mathematical framework for reinforcement learning.

An MDP is defined by:

\[\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma),\]

where:

(\mathcal{S}) is the state space;
(\mathcal{A}) is the action space;
(P(s’ s,a)) is the transition probability;
(R(s,a)) is the reward function;
(\gamma) is the discount factor.

2.1 Markov Property

The Markov property means that the future depends only on the current state and action, not the full history:

\[P(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\ldots)=P(s_{t+1}|s_t,a_t).\]

This assumption simplifies decision making.

In real-world robotics and autonomous driving, the true environment may be partially observable. The agent may not know the full state due to occlusion, sensor noise, and limited field of view. In that case, memory and state estimation become important.

2.2 Value Function

The state-value function measures the expected return starting from state (s) and following policy (\pi):

\[V^\pi(s)=\mathbb{E}_\pi[G_t|s_t=s].\]

It tells us how good it is to be in a state.

The action-value function measures the expected return after taking action (a) in state (s):

\[Q^\pi(s,a)=\mathbb{E}_\pi[G_t|s_t=s,a_t=a].\]

It tells us how good an action is under a given policy.

2.3 Bellman Equation

The Bellman equation expresses value recursively.

For a policy (\pi):

\[V^\pi(s)=\sum_a \pi(a|s)\sum_{s'}P(s'|s,a) \left[R(s,a,s')+\gamma V^\pi(s')\right].\]

This equation says:

The value of a state equals the expected immediate reward plus the discounted value of the next state.

The Bellman equation is the foundation of dynamic programming, temporal-difference learning, and value-based RL.

2.4 Bellman Optimality Equation

The optimal value function satisfies:

\[V^*(s)=\max_a \sum_{s'}P(s'|s,a) \left[R(s,a,s')+\gamma V^*(s')\right].\]

The optimal action-value function satisfies:

\[Q^*(s,a)=\sum_{s'}P(s'|s,a) \left[R(s,a,s')+\gamma \max_{a'}Q^*(s',a')\right].\]

Once (Q^*) is known, the optimal policy is:

\[\pi^*(s)=\arg\max_a Q^*(s,a).\]

3. Dynamic Programming

Dynamic programming solves MDPs when the transition model and reward function are known.

It provides the conceptual foundation for many reinforcement learning algorithms.

3.1 Policy Evaluation

Policy evaluation computes (V^\pi) for a fixed policy (\pi).

The iterative update is:

\[V_{k+1}(s)=\sum_a \pi(a|s)\sum_{s'}P(s'|s,a) \left[R(s,a,s')+\gamma V_k(s')\right].\]

Repeating this update converges to the value function of the policy.

3.2 Policy Improvement

After evaluating a policy, we can improve it by choosing actions with higher value:

\[\pi_{new}(s)=\arg\max_a Q^\pi(s,a).\]

The policy improvement theorem guarantees that the new policy is no worse than the old policy.

3.3 Policy Iteration

Policy iteration alternates between:

policy evaluation;
policy improvement.

This process eventually converges to the optimal policy for finite MDPs.

3.4 Value Iteration

Value iteration combines evaluation and improvement into a single update:

\[V_{k+1}(s)=\max_a \sum_{s'}P(s'|s,a) \left[R(s,a,s')+\gamma V_k(s')\right].\]

It directly updates value estimates toward optimality.

4. Model-Free Prediction and Control

In many real problems, the transition model (P(s’

s,a)) is unknown. Model-free reinforcement learning learns from sampled experience instead of relying on a known model.

4.1 Monte Carlo Methods

Monte Carlo methods estimate value functions from complete episodes.

For a state (s), the value can be estimated by averaging observed returns:

\[V(s) \leftarrow \frac{1}{N(s)}\sum_{i=1}^{N(s)}G_i.\]

Advantages:

simple;
model-free;
unbiased estimates.

Limitations:

requires complete episodes;
high variance;
slow learning.

4.2 Temporal-Difference Learning

Temporal-Difference learning updates value estimates using bootstrapping.

TD(0) update:

\[V(s_t) \leftarrow V(s_t)+\alpha \left[r_t+\gamma V(s_{t+1})-V(s_t)\right].\]

The term:

\[\delta_t=r_t+\gamma V(s_{t+1})-V(s_t)\]

is the TD error.

TD learning combines ideas from Monte Carlo methods and dynamic programming.

4.3 SARSA

SARSA is an on-policy control algorithm.

The update is:

\[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t)\right].\]

It is called SARSA because it uses the tuple:

\[(s_t,a_t,r_t,s_{t+1},a_{t+1}).\]

SARSA learns the value of the policy actually being followed.

4.4 Q-Learning

Q-learning is an off-policy control algorithm.

The update is:

\[Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha \left[r_t+\gamma \max_{a'}Q(s_{t+1},a')-Q(s_t,a_t)\right].\]

Q-learning learns the optimal action-value function regardless of the behavior policy, assuming sufficient exploration.

4.5 Exploration and Exploitation

Reinforcement learning must balance exploration and exploitation.

Exploration: trying new actions to gather information.
Exploitation: choosing actions that currently seem best.

A common strategy is (\epsilon)-greedy exploration:

with probability (\epsilon), choose a random action;
with probability (1-\epsilon), choose the best action.

This trade-off is central to decision making.

5. Deep Reinforcement Learning

Deep reinforcement learning uses neural networks to approximate value functions, policies, or models.

This makes RL applicable to high-dimensional inputs such as images, LiDAR, BEV maps, and occupancy grids.

5.1 Deep Q-Network

DQN approximates the action-value function using a neural network:

\[Q_\theta(s,a).\]

The target is:

\[y = r + \gamma \max_{a'} Q_{\theta^-}(s',a'),\]

where (\theta^-) are target network parameters.

The loss is:

\[\mathcal{L}(\theta)=\left(y-Q_\theta(s,a)\right)^2.\]

Key techniques:

experience replay;
target network;
reward clipping;
epsilon-greedy exploration.

5.2 Policy Gradient

Policy gradient methods directly optimize the policy.

The objective is:

\[J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[G(\tau)].\]

The policy gradient theorem gives:

\[\nabla_\theta J(\theta)= \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi_\theta(a|s)Q^{\pi}(s,a)\right].\]

Policy gradient methods are useful for continuous control and stochastic policies.

5.3 Actor-Critic Methods

Actor-critic methods combine policy learning and value learning.

The actor represents the policy (\pi_\theta(a s)).
The critic estimates a value function (V_\phi(s)) or (Q_\phi(s,a)).

The critic helps reduce variance in policy gradient estimation.

Common actor-critic algorithms include:

A2C;
A3C;
DDPG;
TD3;
SAC;
PPO.

5.4 PPO

Proximal Policy Optimization, or PPO, is a widely used policy optimization algorithm.

It uses a clipped objective to prevent overly large policy updates:

\[\mathcal{L}^{CLIP}(\theta)= \mathbb{E}_t \left[ \min \left( r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t \right) \right],\]

where:

\[r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}.\]

PPO is popular because it is relatively stable and easy to implement compared with many earlier policy gradient methods.

5.5 SAC

Soft Actor-Critic, or SAC, maximizes both reward and entropy.

The objective encourages exploration:

\[J(\pi)=\sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi} \left[r(s_t,a_t)+\alpha \mathcal{H}(\pi(\cdot|s_t))\right].\]

SAC is widely used in continuous control tasks because it is sample-efficient and stable.

6. Model-Based Reinforcement Learning

Model-based RL learns or uses a model of the environment dynamics.

Instead of learning only a policy or value function, the agent learns:

\[\hat{s}_{t+1}=f_\theta(s_t,a_t).\]

or a probabilistic model:

\[p_\theta(s_{t+1}|s_t,a_t).\]

6.1 Planning with a Model

If the model is known or learned, the agent can plan by simulating future trajectories.

A typical planning objective is:

\[a_{0:T}^* = \arg\max_{a_{0:T}} \sum_{t=0}^{T} \gamma^t r(s_t,a_t).\]

This connects RL with classical planning and control.

6.2 World Models

A world model learns a compact representation of environment dynamics.

A typical world model contains:

an encoder that maps observations to latent states;
a dynamics model that predicts future latent states;
a reward model;
sometimes a decoder that reconstructs observations.

For perception research, world models are especially interesting because they connect visual understanding with future prediction.

In occupancy world models, the latent state may be a 3D or 4D occupancy representation rather than a simple vector.

6.3 Imagination Rollouts

A learned dynamics model can generate imagined future trajectories:

\[z_t \rightarrow z_{t+1} \rightarrow z_{t+2} \rightarrow \cdots.\]

These rollouts can be used for planning or policy learning.

The challenge is that model errors may accumulate over time.

Therefore, uncertainty estimation is important in model-based RL.

6.4 Connection to Occupancy Forecasting

Future occupancy forecasting can be viewed as a form of world modeling.

Given past observations:

\[O_{1:t},\]

the model predicts future occupancy states:

\[O_{t+1:t+H}.\]

This is not always trained with RL, but it provides the predictive state representation that can support planning and decision making.

7. Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning studies environments with multiple decision-making agents.

This is important for traffic, robotics, games, and collaborative perception.

7.1 Cooperative and Competitive Settings

In cooperative MARL, agents share a common goal.

Examples:

robot teams;
multi-vehicle coordination;
collaborative perception;
distributed sensing.

In competitive MARL, agents have conflicting goals.

Examples:

games;
adversarial driving scenarios;
pursuit-evasion tasks.

Mixed settings are also common.

7.2 Centralized Training and Decentralized Execution

A common MARL paradigm is centralized training with decentralized execution.

During training, agents may access global information. During execution, each agent acts based on local observations and communication.

This is useful because real-world agents often cannot access full global state at deployment.

7.3 Communication in Multi-Agent Systems

Communication is a central problem in multi-agent systems.

Key questions:

who should communicate?
when should agents communicate?
what information should be sent?
how much information should be sent?
how should received information influence decisions?

These questions are closely related to my research on communication-efficient collaborative perception.

Although my current work focuses on perception rather than RL control, the communication problem has similar structure: agents must exchange useful information under bandwidth constraints.

8. Connections to Autonomous Driving and Embodied AI

Reinforcement learning and decision making are important for autonomous agents, but they must be connected carefully with perception.

8.1 Autonomous Driving Decision Making

Autonomous driving decision making includes:

lane keeping;
lane changing;
car following;
obstacle avoidance;
merging;
intersection handling;
emergency braking.

These tasks require long-term reasoning and safety awareness.

However, pure RL is difficult for real autonomous driving because of safety, sample efficiency, interpretability, and sim-to-real transfer.

Therefore, many systems combine learning-based perception with rule-based planning, optimization-based planning, or imitation learning.

8.2 Embodied AI

Embodied AI agents perceive, remember, decide, and act in physical or simulated environments.

Typical tasks include:

navigation;
object search;
manipulation;
instruction following;
exploration;
multi-agent cooperation.

Reinforcement learning provides a natural framework for embodied AI because agents learn through interaction.

8.3 Perception-Action Loop

The perception-action loop can be written as:

Observation → Representation → Memory → Decision → Action → New Observation

For my research, the key connection is that better 3D perception and world modeling can provide better state representations for downstream decision making.

For example:

semantic occupancy provides structured 3D scene state;
temporal memory provides history;
future occupancy forecasting provides prediction;
collaborative perception provides information beyond the ego view.

These representations can make planning and decision making more reliable.

8.4 Why RL Matters for My Research

Even if my main focus is perception, reinforcement learning helps me understand the broader role of perception in intelligent systems.

My research on occupancy prediction and world models can be connected to RL in several ways:

occupancy grids can serve as state representations;
future occupancy prediction can support model-based planning;
uncertainty-aware occupancy can improve risk-sensitive decision making;
multi-agent perception can support cooperative decision making;
communication-efficient representations can reduce distributed decision cost.

Thus, RL is not separate from perception. It provides the decision-making perspective that explains why predictive and structured perception matters.

9. Personal Study Plan

My reinforcement learning study plan has three layers.

9.1 Classical RL Layer

Main topics:

MDPs;
Bellman equations;
dynamic programming;
Monte Carlo methods;
TD learning;
SARSA;
Q-learning.

Goal:

understand the mathematical foundation of sequential decision making;
implement basic RL algorithms;
connect value functions with planning.

9.2 Deep RL Layer

Main topics:

DQN;
policy gradients;
actor-critic;
PPO;
SAC;
training stability;
exploration.

Goal:

understand how neural networks are used in RL;
learn practical training challenges;
understand why deep RL can be unstable.

9.3 World Model and Embodied Layer

Main topics:

model-based RL;
world models;
latent dynamics;
planning with learned models;
multi-agent RL;
embodied AI tasks;
autonomous driving decision making.

Goal:

connect perception and decision making;
understand how future prediction supports planning;
relate occupancy world models to intelligent agents.

Closing Remarks

Reinforcement learning provides the mathematical framework for learning to act. It connects perception, memory, prediction, planning, and control.

For my PhD preparation, the most important goal is not only to learn RL algorithms, but to understand how RL defines the role of perception in intelligent systems.

The key connections are:

MDPs formalize sequential decision making;
value functions evaluate long-term consequences;
dynamic programming provides planning foundations;
model-free RL learns from experience;
model-based RL connects to world models;
multi-agent RL connects to collaborative systems;
autonomous driving and embodied AI require perception-action loops.

This foundation will help me connect my research in 3D perception, collaborative occupancy prediction, and occupancy world models with downstream decision making and autonomous agents.

Roadmap

1. Sequential Decision Making

1.1 Agent and Environment

1.2 State, Action, and Reward

1.3 Policy

1.4 Trajectory and Return

2. Markov Decision Processes

2.1 Markov Property

2.2 Value Function

2.3 Bellman Equation

2.4 Bellman Optimality Equation

3. Dynamic Programming

3.1 Policy Evaluation

3.2 Policy Improvement

3.3 Policy Iteration

3.4 Value Iteration

4. Model-Free Prediction and Control

4.1 Monte Carlo Methods

4.2 Temporal-Difference Learning

4.3 SARSA

4.4 Q-Learning

4.5 Exploration and Exploitation

5. Deep Reinforcement Learning

5.1 Deep Q-Network

5.2 Policy Gradient

5.3 Actor-Critic Methods

5.4 PPO

5.5 SAC

6. Model-Based Reinforcement Learning

6.1 Planning with a Model

6.2 World Models

6.3 Imagination Rollouts

6.4 Connection to Occupancy Forecasting

7. Multi-Agent Reinforcement Learning

7.1 Cooperative and Competitive Settings

7.2 Centralized Training and Decentralized Execution

7.3 Communication in Multi-Agent Systems

8. Connections to Autonomous Driving and Embodied AI

8.1 Autonomous Driving Decision Making

8.2 Embodied AI

8.3 Perception-Action Loop

8.4 Why RL Matters for My Research

9. Personal Study Plan

9.1 Classical RL Layer

9.2 Deep RL Layer

9.3 World Model and Embodied Layer

Closing Remarks

Enjoy Reading This Article?