Building My PhD Knowledge Base for Computer Vision
As I prepare for PhD applications in computer vision and autonomous driving perception, I realized that strong research ideas must rest on a solid theoretical foundation.
To organize my learning, I decided to build a structured knowledge base covering the core topics. Rather than studying randomly, I want a clear roadmap of concepts that connect mathematics, machine learning, computer vision, and collaborative perception.
This post outlines that roadmap.
1. Mathematical Foundations
Mathematics forms the language of modern machine learning and computer vision. In particular, linear algebra, probability, and optimization are essential.
1.1 Linear Algebra
Beyond solving equations, linear algebra explains how modern models operate.
Key topics:
- Vector spaces: span, basis, dimension, subspaces, null space, rank–nullity theorem
- Linear transformations: matrices as linear maps, change of basis
- Orthogonality and projections: least squares, Gram–Schmidt
- Eigen decomposition: eigenvalues, eigenvectors, spectral theorem, Rayleigh quotient
- Singular Value Decomposition (SVD): low-rank approximation, Eckart–Young theorem
- Matrix norms: Frobenius norm, spectral norm, condition number
- Structured matrices: positive semidefinite matrices, block matrices, Schur complement
- (Optional) Random matrix intuition: concentration phenomena and spectral behavior
These ideas appear everywhere in deep learning—for example in attention mechanisms, low-rank approximations, and token merging strategies.
1.2 Probability and Statistics
Occupancy prediction and sensor fusion rely heavily on probabilistic reasoning.
Important concepts:
- Random variables: PMF, PDF, CDF, expectation, variance, covariance
- Common distributions: Gaussian, Bernoulli, Categorical, Poisson, Dirichlet
- Multivariate Gaussian distributions and covariance structures
- Conditional probability and Bayes’ rule
- Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)
- KL divergence, JS divergence, and cross entropy
- Hypothesis testing and confidence intervals
- Monte Carlo estimation and importance sampling
- Uncertainty estimation:
- aleatoric uncertainty
- epistemic uncertainty
- calibration and reliability diagrams
Understanding these concepts helps interpret model confidence and uncertainty in perception systems.
1.3 Optimization
Training deep neural networks is fundamentally an optimization problem.
Core topics include:
- Gradient computation and the chain rule (backpropagation)
- First-order methods: SGD, Momentum, Nesterov acceleration
- Adaptive optimizers: Adam and AdamW
- Learning rate schedules: warmup, cosine decay, step schedules
- Regularization: weight decay, dropout
- Constrained optimization: Lagrangian methods and KKT conditions
- Second-order ideas: Hessian, curvature, saddle points
- Sharp vs flat minima and their relation to generalization
- Numerical stability techniques such as the log-sum-exp trick
These principles explain why some training setups converge reliably while others fail.
2. Machine Learning Theory
PhD student need to understand why machine learning models work.
2.1 Statistical Learning Theory
Topics I aim to master:
- Empirical risk vs expected risk
- Bias–variance tradeoff
- Overfitting and regularization
- VC dimension (conceptual understanding)
- Rademacher complexity and model capacity
- Generalization bounds
- Distribution shift:
- covariate shift
- label shift
- out-of-distribution detection
These ideas are especially relevant for autonomous driving systems deployed in changing environments.
2.2 Representation Learning
Modern deep learning focuses heavily on learning useful representations.
Key concepts:
- Invariance and equivariance in feature representations
- Contrastive learning (InfoNCE loss)
- Self-supervised learning (MAE, DINO, etc.)
- Information Bottleneck theory
- Inductive biases in CNNs and Transformers
Understanding these principles helps explain why certain architectures work better for perception tasks.
2.3 Generative Modeling
Generative models are increasingly used for scene representation and future prediction.
Important ideas:
- Variational inference and ELBO
- Variational Autoencoders (VAEs)
- Diffusion models and score matching
- Energy-based models (EBMs)
- Bayesian deep learning for uncertainty estimation
These methods are becoming relevant for generative occupancy and world modeling.
3. Deep Learning Architectures
3.1 Neural Network Fundamentals
Topics to revisit:
- Multilayer perceptrons and activation functions (ReLU, GELU)
- Weight initialization (Xavier, He)
- Normalization methods: BatchNorm, LayerNorm, RMSNorm
- Residual connections and deep network training
- Dropout and stochastic depth
3.2 Transformers
Transformers have become the dominant architecture in computer vision.
Key topics include:
- Scaled dot-product attention
- Multi-head attention and representation diversity
- Positional encodings (absolute, relative, RoPE)
- Encoder–decoder structures and cross-attention
- Efficient attention methods (FlashAttention, linear attention)
- Tokenization and patch embeddings in Vision Transformers
These concepts are directly related to my research on token-based scene representations.
4. Computer Vision Foundations
4.1 2D Vision
Core topics include:
- Convolution and receptive fields
- Feature pyramid networks (FPN)
- Object detection paradigms
- Semantic, instance, and panoptic segmentation
- Evaluation metrics such as IoU and mIoU
4.2 Multi-view Geometry
For autonomous driving perception, geometric reasoning is crucial.
Important concepts:
- Camera models and projection geometry
- Intrinsic and extrinsic parameters
- Coordinate transformations and SE(3)
- Epipolar geometry and fundamental matrices
- Triangulation and bundle adjustment
- Perspective-n-Point (PnP) pose estimation
4.3 3D Scene Representations
Common representations include:
- Point clouds (PointNet family)
- Voxels and sparse convolution
- Bird’s-Eye-View (BEV) representations
- Implicit representations such as occupancy fields and signed distance functions
5. Autonomous Driving Perception
My main research direction lies here.
5.1 BEV Representation and Sensor Fusion
Key paradigms include:
- Early, middle, and late fusion
- View transformation methods such as lift-splat
- Cross-attention based lifting
- Temporal alignment and ego-motion compensation
5.2 Occupancy Prediction
Important topics:
- Binary vs semantic occupancy grids
- Voxel resolution trade-offs
- Visibility reasoning and occlusion handling
- Extreme class imbalance in dense prediction tasks
5.3 Temporal Modeling
Dynamic environments require temporal reasoning.
Topics include:
- Memory banks and temporal aggregation
- Temporal consistency across frames
- Motion modeling and scene flow
- Online vs offline perception constraints
6. Collaborative Perception and Communication
Multi-agent perception introduces new challenges.
6.1 Collaboration Paradigms
Important research questions include:
- What to communicate
- When to communicate
- Who to communicate with
Common fusion methods include feature concatenation, attention, and graph-based aggregation.
6.2 Communication Efficiency
Bandwidth constraints require efficient message design.
Key ideas:
- Communication compression
- Token selection and token merging
- Quantization and pruning
- Task-aware communication strategies
These ideas directly connect to my current research project LiteTokenOcc.
Closing Thoughts
This roadmap represents the theoretical foundation I want to master.
For each topic, my goal is to be able to:
- Define the concept clearly
- Explain why it matters
- Connect it to my research
Research ultimately connects ideas across fields. This knowledge map is my attempt to build those connections systematically.
Enjoy Reading This Article?
Here are some more articles you might like to read next: