Learning to Merge Tokens for Communication-Efficient Collaborative Occupancy Prediction

Communication-efficient multi-agent 3D occupancy prediction with token-based representation and adaptive communication

Overview

This project focuses on communication-efficient collaborative 3D semantic occupancy prediction for autonomous driving.

The goal is to enable multiple agents (vehicles) to collaboratively perceive the environment while operating under strict communication bandwidth constraints.

The work explores how to design compact scene representations and adaptive communication mechanisms to balance perception performance and communication cost.

This project is currently ongoing and planned for submission to CVPR 2026.


Problem Setting

In collaborative perception, each agent observes only a partial view of the environment.

To achieve global scene understanding, agents need to exchange information. However:

  • communication bandwidth is limited
  • redundant information transfer is common
  • naive feature sharing is inefficient

This project studies how to select, compress, and exchange only the most informative content across agents.


Key Ideas

The system is built around three core ideas:

1. Tokenized Scene Representation

Instead of dense feature maps, the scene is represented as a set of compact tokens, enabling efficient information exchange.


2. Spatio-Temporal Information Reuse

A memory mechanism is introduced to reuse information across:

  • time (historical frames)
  • agents (collaborative context)

This reduces redundant communication.


3. Adaptive Communication

Communication is designed to be task-aware and selective, where:

  • only relevant information is exchanged
  • redundant or low-value content is filtered

This significantly improves communication efficiency.


System Overview

The system follows a collaborative perception pipeline:

Perception → Representation → Communication → Fusion → Prediction

Key components include:

  • token generation from multi-view inputs
  • inter-agent communication mechanism
  • feature fusion across agents
  • occupancy prediction head

Experimental Findings

Preliminary experiments show that:

  • the system achieves strong perception performance
  • communication cost can be significantly reduced (KB-level)
  • efficient trade-off between performance and bandwidth is achieved

Research Significance

This project contributes toward:

  • communication-efficient multi-agent perception
  • scalable autonomous driving systems
  • connections to world model and structured scene representation

Note

Details of the method are intentionally omitted due to ongoing submission.