Transformers changed the way we think about representations.

Images can be represented as patches. Language can be represented as word or subword tokens. BEV feature maps can also be converted into spatial tokens. For collaborative 3D perception, this raises an interesting possibility:

Can agents communicate compact token sets instead of dense feature maps?

This note summarizes my current thinking on token communication for multi-agent 3D perception.

1. Why Tokens Are Useful

A dense feature map is expensive to communicate. If every agent sends every feature at every location, the bandwidth cost becomes large.

Tokens provide a more flexible representation.

A token can represent:

a BEV grid cell;
a local spatial region;
a semantic object;
a memory element;
an uncertain region;
a compressed scene component.

Because tokens are discrete elements, they can be selected, ranked, merged, pruned, or transmitted under a budget.

This makes tokens useful for communication-constrained perception.

2. Tokenized BEV Representations

In autonomous driving perception, BEV is a common representation because it aligns with the ground plane and downstream planning.

A BEV feature map can be written as:

\[F \in \mathbb{R}^{H \times W \times C}.\]

It can be flattened into tokens:

\[T = \{t_i\}_{i=1}^{N}, \quad t_i \in \mathbb{R}^{C},\]

where each token corresponds to a spatial region.

This tokenization makes it possible to use attention-based fusion:

\[\text{Attn}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V.\]

For collaborative perception, the ego agent can attend to tokens from neighboring agents after spatial alignment.

3. Token Selection

The simplest way to reduce communication is token selection.

Given a set of tokens, the model selects a subset:

\[T' \subset T, \quad |T'| \leq B,\]

where (B) is the communication budget.

Selection can be based on:

feature magnitude;
semantic importance;
uncertainty;
attention scores;
spatial location;
ego requests;
predicted contribution to occupancy accuracy.

However, token selection has a limitation: unselected tokens are discarded. If the selection policy is wrong, useful information may be lost.

This is why token merging is also important.

4. Token Merging

Token merging reduces the number of tokens by combining similar or redundant tokens.

Instead of dropping all low-priority tokens, the model can merge them into compact summaries.

A simple formulation is:

\[\tilde{t}_j = \sum_{i \in \mathcal{G}_j} \alpha_i t_i,\]

where (\mathcal{G}_j) is a group of tokens and (\alpha_i) are merging weights.

Merging is useful because many BEV regions are redundant:

large empty roads;
static background;
repeated structures;
low-uncertainty free space.

At the same time, important regions should be preserved with higher resolution:

occluded areas;
moving objects;
intersections;
regions near the ego trajectory;
areas requested by the ego agent.

This suggests a content-aware merging strategy.

5. Request-Aware Communication

In collaborative perception, the ego agent does not need all information from all neighbors.

It needs information that improves its own scene understanding.

A request-aware communication system can work as follows:

The ego agent predicts an initial occupancy map.
It identifies uncertain or important regions.
It sends requests to neighboring agents.
Neighboring agents protect request-relevant tokens.
Other tokens are compressed or merged.
The ego agent fuses received tokens and refines occupancy.

This design is appealing because communication is guided by the ego agent’s needs.

The message is not only sender-centric. It becomes receiver-aware.

6. Communication Budget as a Research Variable

A good collaborative perception method should be evaluated under different bandwidth budgets.

If a method only works when communication is unlimited, it may not be practical.

Important evaluation questions include:

How does accuracy change as bandwidth decreases?
Which regions benefit most from communication?
Does the method preserve performance under severe compression?
Is the communication strategy adaptive to scene complexity?
Does it remain robust under pose noise and latency?

For occupancy prediction, I care about both overall mIoU and the quality of important regions such as dynamic objects and occluded areas.

7. Connections to My Current Projects

Token communication is central to my current research.

The ideas I want to develop include:

tokenized collaborative occupancy prediction;
ego-driven token requests;
content-aware token merging;
bandwidth-aware token allocation;
temporal token memory;
collaborative 4D occupancy world models.

The long-term goal is to build a system where agents communicate compact but useful scene representations.

This is not just a compression problem. It is a perception, communication, and reasoning problem.

8. Open Questions

Several questions remain important:

How should token importance be measured?
Should tokens represent fixed spatial cells or adaptive regions?
How can merging preserve semantic boundaries?
How should temporal memory interact with communication?
Can communication be learned end-to-end with occupancy supervision?
How can the method generalize across bandwidth conditions?

These questions will guide my next stage of research.

Token Communication for Multi-Agent 3D Perception