AI Agents and Embodied Intelligence

AI agents and embodied intelligence are becoming increasingly important for future intelligent systems. While traditional computer vision mainly focuses on understanding images or videos, embodied intelligence asks a broader question:

How can an agent perceive, remember, reason, plan, and act in the physical world?

For my PhD preparation, I mainly focus on the perception and world-modeling layer of embodied agents. My current research interests, including 3D perception, semantic occupancy prediction, collaborative perception, and occupancy world models, can be viewed as building structured scene representations for autonomous and embodied agents.

This note is my long-term study record for AI agents and embodied intelligence.


Roadmap

This note is organized into the following chapters:

  1. AI Agents
    Agent definition, perception, memory, planning, action, tool use, autonomy, and interaction with environments.

  2. Embodied Intelligence
    Embodied perception, physical grounding, perception-action loops, navigation, manipulation, and interactive learning.

  3. Perception for Embodied Agents
    Visual perception, 3D scene understanding, semantic occupancy, spatial memory, and multi-modal sensing.

  4. Memory and World Models
    Short-term memory, long-term memory, spatial memory, predictive models, occupancy world models, and future scene prediction.

  5. Planning and Decision Making
    Classical planning, reinforcement learning, model-based planning, task planning, and hierarchical decision making.

  6. Multi-Agent and Collaborative Intelligence
    Cooperation, communication, shared perception, collaborative mapping, and multi-agent world modeling.

  7. Connections to Autonomous Driving
    Autonomous vehicles as embodied agents, perception-action pipelines, collaborative driving, occupancy prediction, and safety-critical scene understanding.

  8. Personal Study Plan
    A staged plan for studying AI agents and embodied intelligence from the perspective of computer vision and 3D perception.


1. AI Agents

An AI agent is a system that can perceive its environment, make decisions, and take actions to achieve goals.

A general agent loop can be written as:

Observation → Perception → Memory → Reasoning / Planning → Action → New Observation

This loop is different from static prediction tasks. In image classification, the model receives an image and outputs a label. In an agent system, the output action changes the future input distribution.


1.1 What Is an Agent?

An agent can be defined by several components:

  • Observation: what the agent receives from the environment;
  • State representation: how the agent internally represents the environment;
  • Memory: how the agent stores past information;
  • Policy or planner: how the agent selects actions;
  • Action space: what the agent can do;
  • Goal or reward: what the agent is trying to optimize.

In embodied settings, observations may include:

  • RGB images;
  • depth maps;
  • LiDAR point clouds;
  • tactile signals;
  • proprioception;
  • language instructions;
  • maps or occupancy grids.

The key challenge is to convert these observations into useful representations for decision making.


1.2 Agent Architectures

A typical agent architecture contains several modules:

  1. Perception module
    Extracts useful information from raw sensor observations.

  2. Memory module
    Stores past observations, states, or learned knowledge.

  3. World model
    Predicts future states or simulates possible outcomes.

  4. Planner / policy
    Chooses actions based on goals and current state.

  5. Controller
    Executes low-level actions in the environment.

For real-world autonomous systems, these modules may be learned end-to-end, designed separately, or combined in a hybrid system.


1.3 Reactive Agents vs Deliberative Agents

A reactive agent maps observations directly to actions:

\[a_t = \pi(o_t).\]

This is simple and fast, but may fail when memory or long-term reasoning is required.

A deliberative agent maintains an internal state or model:

\[z_t = f(z_{t-1}, o_t),\]

then plans actions based on the state:

\[a_t = \pi(z_t, g),\]

where (g) is the goal.

Embodied AI often requires deliberative behavior because agents must remember past observations, reason about occluded regions, and plan over long horizons.


1.4 Tool-Using and LLM-Based Agents

Recent AI agent research often studies large language model agents that can use tools, call APIs, write code, search information, or interact with software environments.

These agents usually involve:

  • language-based reasoning;
  • tool use;
  • memory retrieval;
  • planning;
  • feedback loops;
  • multi-step task solving.

Although this direction is different from my main research in 3D perception, it provides useful ideas about memory, planning, task decomposition, and multi-agent cooperation.

For embodied agents, language can also provide goals or instructions, such as:

Find the red chair in the room and move toward it.

This connects AI agents, embodied perception, and language-guided robotics.


2. Embodied Intelligence

Embodied intelligence studies agents that perceive and act in physical or simulated environments.

The key idea is that intelligence is grounded in interaction. An embodied agent cannot only understand static data; it must use perception to support action.


2.1 What Makes Intelligence Embodied?

An intelligent system becomes embodied when it has:

  • a body or physical embodiment;
  • sensors to perceive the world;
  • actions that affect the world;
  • a need to reason about space, time, and interaction;
  • goals that require environmental feedback.

Examples include:

  • mobile robots;
  • humanoid robots;
  • autonomous vehicles;
  • drones;
  • robot arms;
  • virtual agents in simulated environments.

Autonomous driving can also be viewed as a form of embodied intelligence, because the vehicle perceives the world, predicts future states, and acts through control commands.


2.2 Perception-Action Loop

The perception-action loop is central to embodied intelligence:

Sense → Understand → Decide → Act → Sense Again

This loop creates several challenges:

  • perception errors affect future actions;
  • actions change future observations;
  • the agent must reason under uncertainty;
  • the environment may be dynamic;
  • decisions must be made in real time.

For example, if an autonomous vehicle incorrectly predicts a region as free space, the downstream planner may choose an unsafe path. This shows why reliable 3D perception is critical for embodied agents.


2.3 Embodied Perception

Embodied perception means perception for an acting agent.

Unlike image recognition, embodied perception must consider:

  • ego-motion;
  • partial observability;
  • occlusion;
  • active viewpoint selection;
  • temporal memory;
  • physical constraints;
  • action consequences.

For my research, semantic occupancy prediction is a natural representation for embodied perception because it provides structured 3D information about free, occupied, and semantic regions.


2.4 Embodied Tasks

Common embodied AI tasks include:

  • visual navigation;
  • object search;
  • embodied question answering;
  • instruction following;
  • manipulation;
  • exploration;
  • rearrangement;
  • multi-agent cooperation.

These tasks require different combinations of perception, memory, planning, and control.

For example, object search requires the agent to remember previously visited regions and reason about where the target object may be located.


3. Perception for Embodied Agents

Perception is the foundation of embodied intelligence. Without reliable perception, an agent cannot build accurate internal states for planning or control.


3.1 Visual Perception

Visual perception includes tasks such as:

  • object detection;
  • semantic segmentation;
  • depth estimation;
  • optical flow;
  • visual tracking;
  • scene recognition;
  • visual grounding.

For embodied agents, visual perception must often run online and support real-time decision making.


3.2 3D Scene Understanding

Embodied agents need 3D understanding because actions happen in physical space.

Important 3D representations include:

  • point clouds;
  • voxel grids;
  • BEV maps;
  • meshes;
  • implicit fields;
  • occupancy grids;
  • Gaussian primitives.

Different representations are useful for different tasks. For autonomous driving, BEV and occupancy grids are especially useful because they align with planning and navigation.


3.3 Semantic Occupancy for Agents

Semantic occupancy prediction estimates both geometry and semantics in 3D space.

The output can be written as:

\[O \in \{0,1,\ldots,K\}^{X \times Y \times Z}.\]

This representation tells the agent:

  • where space is free;
  • where space is occupied;
  • what semantic class each occupied region belongs to;
  • which regions may be occluded or uncertain.

For embodied agents, this is more informative than 2D segmentation because it directly describes the spatial structure of the world.


3.4 Spatial Memory

Because agents only observe part of the environment at each time step, memory is necessary.

Spatial memory stores information about previously observed regions.

Examples:

  • occupancy maps;
  • semantic maps;
  • topological maps;
  • BEV feature memories;
  • token memories;
  • object-centric memories.

For my research, tokenized spatio-temporal memory is an important direction because it provides a compact way to store and update 3D scene information over time.


3.5 Active Perception

Active perception means the agent chooses actions to improve future perception.

For example:

  • moving to a better viewpoint;
  • looking around to reduce uncertainty;
  • requesting information from another agent;
  • adjusting sensor orientation;
  • exploring unknown regions.

This connects naturally to collaborative perception. In multi-agent systems, an agent may actively request information from neighbors to reduce uncertainty in occluded regions.


4. Memory and World Models

Memory and world models are central to intelligent agents.

Memory stores past information. A world model predicts future information.

Together, they allow an agent to reason beyond the current observation.


4.1 Short-Term and Long-Term Memory

Short-term memory stores recent observations or hidden states. It is useful for temporal smoothing and immediate context.

Long-term memory stores information over longer horizons. It is useful for mapping, navigation, and repeated interaction.

Examples:

  • RNN hidden states;
  • Transformer memory tokens;
  • spatial maps;
  • object memories;
  • retrieval databases;
  • episodic memory.

For perception systems, memory helps handle occlusion, missing observations, and temporal inconsistency.


4.2 Spatial Memory

Spatial memory organizes information according to physical space.

Examples include:

  • 2D occupancy maps;
  • 3D voxel maps;
  • BEV semantic maps;
  • neural maps;
  • topological graphs;
  • tokenized scene memory.

Spatial memory is especially important for embodied agents because they move through the world and need to remember what has been observed.


4.3 World Models

A world model is an internal model that predicts how the environment evolves.

A general world model can be written as:

\[z_{t+1} = f_\theta(z_t, a_t),\]

where:

  • (z_t) is the latent state;
  • (a_t) is the action;
  • (z_{t+1}) is the predicted next state.

For perception-only world models, the action may be omitted or replaced by ego-motion, and the model predicts future scene states:

\[\hat{O}_{t+1:t+H}=f_\theta(O_{1:t}).\]

4.4 Occupancy World Models

Occupancy world models represent the world using 3D or 4D occupancy states.

Instead of predicting only the current scene, the model predicts future occupancy:

\[O_t \rightarrow O_{t+1}, O_{t+2}, \ldots, O_{t+H}.\]

This is useful because autonomous agents need to anticipate how the scene may change.

Examples of important questions:

  • Which regions will become occupied?
  • Which objects are moving?
  • How will occluded regions evolve?
  • How uncertain is the future prediction?
  • How can prediction support planning?

For my research, collaborative 4D occupancy world modeling is a natural extension of collaborative 3D occupancy prediction.


4.5 Motion-Aware Memory

Motion-aware memory stores not only what was observed, but also how the scene changes over time.

A motion-aware memory should capture:

  • object motion;
  • ego-motion;
  • temporal consistency;
  • dynamic occupancy changes;
  • uncertainty in future states.

This is important for dynamic environments such as autonomous driving, where agents must understand vehicles, pedestrians, and other moving objects.


5. Planning and Decision Making

Planning converts perception and world models into actions.

Even if my research focuses mainly on perception, understanding planning helps clarify why structured 3D representations matter.


5.1 Classical Planning

Classical planning searches for an action sequence that reaches a goal.

A plan can be written as:

\[a_{0:T} = (a_0, a_1, \ldots, a_T).\]

The goal is to find a sequence that satisfies task constraints and avoids unsafe states.

In robotics and autonomous driving, planning often uses:

  • graph search;
  • sampling-based planning;
  • trajectory optimization;
  • model predictive control;
  • rule-based decision systems.

5.2 Reinforcement Learning

Reinforcement learning learns policies through interaction.

The agent maximizes expected return:

\[J(\pi)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right].\]

RL is useful for learning complex behaviors, but real-world deployment is challenging due to:

  • sample efficiency;
  • safety;
  • exploration risk;
  • sim-to-real transfer;
  • reward design.

5.3 Model-Based Planning

A world model can support model-based planning by simulating future outcomes.

Given a candidate action sequence, the model predicts future states:

\[z_t, a_t \rightarrow z_{t+1}, a_{t+1} \rightarrow z_{t+2}.\]

The planner selects actions that lead to desirable predicted states.

This is why future occupancy prediction is useful: it provides a structured prediction of future space occupancy for downstream planning.


5.4 Hierarchical Decision Making

Complex agents often use hierarchical decision making.

For example, an autonomous driving system may contain:

  • route planning;
  • behavior planning;
  • motion planning;
  • low-level control.

Similarly, an embodied robot may contain:

  • task planning;
  • navigation planning;
  • manipulation planning;
  • motor control.

Hierarchical design improves interpretability and modularity.


6. Multi-Agent and Collaborative Intelligence

Many real-world intelligent systems involve multiple agents.

Examples:

  • connected vehicles;
  • robot teams;
  • drone swarms;
  • multi-agent embodied environments;
  • collaborative mapping systems.

Multi-agent intelligence introduces cooperation, communication, coordination, and shared world modeling.


6.1 Multi-Agent Perception

Multi-agent perception allows agents to share observations or features to improve scene understanding.

Benefits:

  • reduces occlusion;
  • extends field of view;
  • improves long-range perception;
  • increases robustness;
  • provides complementary viewpoints.

Challenges:

  • communication bandwidth;
  • pose alignment;
  • time synchronization;
  • noisy messages;
  • heterogeneous sensors;
  • agent selection.

This is directly related to my current research in collaborative occupancy prediction.


6.2 Communication in Multi-Agent Systems

Communication is a key problem in multi-agent intelligence.

Important questions:

  • who should communicate?
  • what should be communicated?
  • when should communication happen?
  • how much information should be transmitted?
  • how should messages be fused?

In collaborative perception, communication should be compact, task-relevant, and robust.

Token-based communication is attractive because tokens can represent spatial regions, semantic content, memory, or future predictions.


6.3 Collaborative World Models

A collaborative world model uses information from multiple agents to build a better predictive model of the environment.

Compared with a single-agent world model, it can:

  • observe more regions;
  • reduce uncertainty;
  • improve prediction in occluded areas;
  • provide more complete temporal context;
  • support cooperative decision making.

For autonomous driving, collaborative 4D occupancy world models could help agents anticipate dynamic scenes beyond their individual field of view.


7. Connections to Autonomous Driving

Autonomous vehicles can be viewed as embodied agents operating in a dynamic, safety-critical environment.

They must:

  • perceive the environment;
  • build a state representation;
  • predict future changes;
  • plan safe trajectories;
  • execute control commands.

7.1 Autonomous Driving as Embodied Intelligence

Although autonomous driving is often studied separately from robotics, it has the core properties of embodied intelligence:

  • the agent has sensors;
  • the agent moves in the world;
  • actions affect future observations;
  • perception must support planning;
  • safety depends on spatial and temporal reasoning.

Therefore, autonomous driving perception can be viewed as a perception layer for embodied agents.


7.2 Semantic Occupancy as Agent State

Semantic occupancy prediction provides a structured state representation.

It tells the agent:

  • where objects are;
  • what semantic classes they belong to;
  • where free space exists;
  • which areas are occluded;
  • how the scene is structured in 3D.

This makes occupancy suitable as an intermediate representation between perception and planning.


7.3 Collaborative Perception for Driving Agents

Connected vehicles can exchange information to improve perception.

A vehicle may not see an occluded pedestrian, but another nearby vehicle may have a clear view. Collaborative perception allows agents to share complementary information.

However, communication must be efficient and selective. This motivates research on:

  • request-based communication;
  • token selection;
  • token merging;
  • adaptive communication budgets;
  • communication-aware fusion.

7.4 From Occupancy Prediction to World Models

Current-frame occupancy prediction answers:

What is the 3D state of the scene now?

Occupancy world modeling asks:

How will the 3D state of the scene evolve in the future?

This shift is important because intelligent agents need prediction, not just reconstruction.

For autonomous driving, future occupancy can support:

  • risk assessment;
  • trajectory planning;
  • collision avoidance;
  • behavior prediction;
  • cooperative decision making.

8. Personal Study Plan

My AI agent and embodied intelligence study plan has three layers.

8.1 Agent Foundations Layer

Main topics:

  • agent architectures;
  • perception-action loops;
  • memory;
  • planning;
  • reinforcement learning;
  • tool use and task decomposition.

Goal:

  • understand what makes a system an agent;
  • learn how perception, memory, and action interact;
  • connect agent design with decision making.

8.2 Embodied Perception Layer

Main topics:

  • visual navigation;
  • embodied perception;
  • 3D scene understanding;
  • semantic mapping;
  • spatial memory;
  • active perception.

Goal:

  • understand perception for acting agents;
  • study how agents build and update spatial representations;
  • connect 3D perception with embodied tasks.

8.3 World Model and Multi-Agent Layer

Main topics:

  • world models;
  • occupancy forecasting;
  • motion-aware memory;
  • collaborative perception;
  • multi-agent communication;
  • cooperative world modeling.

Goal:

  • connect my research direction to embodied intelligence;
  • develop ideas for predictive 3D scene understanding;
  • study how multiple agents can build better world models together.

Closing Remarks

AI agents and embodied intelligence provide the broader context for my research in 3D perception and autonomous driving.

The key insight is that perception is not an isolated task. For intelligent agents, perception must support memory, prediction, planning, and action.

The most important connections are:

  • AI agents require perception, memory, reasoning, planning, and action;
  • embodied intelligence grounds AI in physical interaction;
  • semantic occupancy provides structured 3D scene state;
  • world models predict how the scene evolves;
  • collaborative perception allows agents to share complementary information;
  • autonomous driving can be viewed as embodied intelligence in a safety-critical environment.

This foundation will help me position my research in collaborative 3D perception, occupancy prediction, and 4D occupancy world models within the broader direction of AI agents and embodied intelligence.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • LLM Learning: From Pretraining to Decoder Inference
  • LLM学习:从 Pretraining 到 Decoder 推理
  • Refining My PhD Research Direction Around 3D Perception
  • 围绕三维感知进一步明确 Ph.D. 研究方向
  • From Occupancy Prediction to Occupancy World Models