Advanced Topics · research-grade
Vision-Language-Action Models: the math, the proofs, the tricks.
A VLA is a single neural network that looks, reads an instruction, and acts: a policy \(\pi_\theta(a \mid \text{image}, \text{language})\) that turns pixels and words straight into robot motor commands. This module builds the idea from scratch: why naive imitation provably drifts, how action chunking, diffusion and flow matching fix it, then walks the literature paper by paper: ACT, Diffusion Policy, RT-1/RT-2, Open X-Embodiment, OpenVLA, π0, FAST, Gemini Robotics, GR00T N1, π0.5 and more, with real math, step-through proofs you unfold yourself, advanced visualizations, and a citation for every paper.
How to read this
Every model is dissected the same way.
- Guided lesson: one idea per screen, plain words first, the picture beside it.
- The math: each equation with a plain-English reading and a symbol legend.
- Proof, step by step: unfold the derivation one move at a time; each step has a "why this move?" with the assumption or trick it uses.
- Advanced visualization: a computational animation of the mechanism, plus a worked numerical example.
- Assumptions · tricks · why this math · exercises · open problems, then the citation.
The deck
Pick a model.
Grouped by era. Start with Foundations if imitation learning is new.
How the ideas connect
One family tree.
Two rivers feed every modern VLA - internet-scale vision-language pretraining (the brain) and robot imitation data (the body) - joined by a choice of action representation (discrete tokens, diffusion, or flow). Open any model to see what it builds on and what it leads to.
References
Every paper discussed.
Click a title to open it on arXiv. This module is a teaching companion to the primary sources. Read them.
Foundational imitation-learning theory: Ross, Gordon & Bagnell, A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (AISTATS 2011), arXiv:1011.0686.
Related rooms