When I wrote the first version of my PhD knowledge base, it was intentionally broad.

I wanted to organize the foundations I need: mathematics, machine learning, deep learning, computer vision, graphics, reinforcement learning, autonomous driving, and embodied intelligence.

After several months of reading and research, my direction has become more focused.

The central theme is:

Efficient and predictive 3D scene understanding for autonomous and embodied agents.

This note is a checkpoint for clarifying that direction.

1. What I Want to Study

My research interests can be summarized by several connected topics:

3D perception;
semantic occupancy prediction;
collaborative perception;
communication-efficient multi-agent systems;
token-based scene representation;
temporal memory;
occupancy world models.

These topics may look separate, but I see them as parts of one problem:

How can an intelligent agent build a useful representation of the 3D world under limited observation, limited communication, and limited computation?

This question is important for autonomous driving, robotics, and embodied AI.

2. Why 3D Scene Understanding

Intelligent agents act in physical space.

For this reason, 3D scene understanding is more than a perception benchmark. It is the foundation for planning, control, navigation, and interaction.

A 2D image tells the agent what the camera sees. A 3D representation tells the agent where things are in the world.

Important 3D representations include:

point clouds;
BEV features;
voxel grids;
semantic occupancy;
implicit fields;
object-centric representations;
tokenized scene memories.

Among them, semantic occupancy is especially attractive because it combines geometry and semantics in a planning-friendly format.

3. Why Collaboration

No single agent can observe everything.

Collaboration allows agents to share information and reduce partial observability.

This is particularly valuable in autonomous driving:

vehicles can help each other see around occlusions;
infrastructure can provide complementary views;
multiple agents can improve robustness in complex scenes;
shared perception can support safer decision making.

But collaboration is not free.

Communication bandwidth is limited. Pose alignment is imperfect. Messages may be delayed. Some transmitted information may be redundant.

Therefore, the research problem is not only collaborative perception. It is communication-efficient collaborative perception.

4. Why Tokens

Tokens provide a compact and flexible representation.

Dense feature maps are difficult to communicate efficiently. Tokens can be selected, ranked, merged, or stored in memory.

For multi-agent 3D perception, tokens may represent:

spatial BEV regions;
semantic areas;
uncertain zones;
object-like entities;
temporal memory elements;
request-relevant information.

This makes token communication a promising direction for bandwidth-aware collaboration.

The key is to make token communication task-aware. Agents should not simply send the most visually salient tokens. They should send tokens that improve the ego agent’s final occupancy prediction.

5. Why World Models

Current perception is not enough for autonomous agents.

Agents must reason about the future.

Occupancy world models extend semantic occupancy prediction from current 3D reconstruction to future 4D scene prediction.

This connects perception with:

motion forecasting;
temporal reasoning;
uncertainty;
planning;
embodied intelligence.

For me, occupancy world models are a natural next step after collaborative occupancy prediction. If multiple agents can collaborate to understand the current scene, they may also collaborate to predict how the scene will evolve.

6. A Coherent Research Thread

I can now describe my research direction as a sequence:

Build strong 3D scene representations.
Use semantic occupancy as a structured output.
Introduce collaboration to overcome single-agent limitations.
Use token communication to reduce bandwidth cost.
Add temporal memory to stabilize scene understanding.
Extend current occupancy to future occupancy world models.

This sequence gives me a clearer way to explain my work in applications and conversations with potential advisors.

It also helps me evaluate whether a new idea belongs to my direction. If it improves efficient, collaborative, or predictive 3D scene understanding, it is probably relevant.

7. What I Need to Strengthen

To support this direction, I still need to strengthen several foundations.

7.1 Geometry

Multi-view geometry, coordinate transformations, pose alignment, and 3D projection are essential for collaborative perception.

7.2 Optimization

Training multi-module perception systems requires stable optimization and careful loss design.

7.3 Representation Learning

Token communication depends on learning compact and meaningful representations.

7.4 Uncertainty

Occupancy prediction and future forecasting both require uncertainty-aware reasoning.

7.5 Systems Thinking

Communication, latency, memory, and computation all affect real-world perception systems.

Research ideas should be evaluated not only by accuracy, but also by practicality.

8. Application Positioning

For PhD applications, I want my website and research statement to communicate a clear identity:

I am interested in efficient 3D perception and world modeling for autonomous and embodied agents, especially under multi-agent communication constraints.

This identity connects my current projects and future goals.

It also gives potential advisors a concrete picture of what kinds of problems I want to work on.

I do not want my profile to look like a list of unrelated topics. I want it to show a research trajectory.

9. Closing Thoughts

The purpose of the knowledge base is not only to collect notes. It is to help me think.

The broad roadmap from March helped me see the full landscape. The recent notes help me narrow the direction.

For now, the path is becoming clearer:

3D perception → semantic occupancy → collaborative communication → temporal memory → occupancy world models.

This will be the thread I continue to develop in my research and PhD preparation.

Refining My PhD Research Direction Around 3D Perception