Advanced Topics · research-grade

Vision-Language-Action Models: the math, the proofs, the tricks.

A VLA is a single neural network that looks, reads an instruction, and acts: a policy \(\pi_\theta(a \mid \text{image}, \text{language})\) that turns pixels and words straight into robot motor commands. This module builds the idea from scratch: why naive imitation provably drifts, how action chunking, diffusion and flow matching fix it, then walks the literature paper by paper: ACT, Diffusion Policy, RT-1/RT-2, Open X-Embodiment, OpenVLA, π0, FAST, Gemini Robotics, GR00T N1, π0.5 and more, with real math, step-through proofs you unfold yourself, advanced visualizations, and a citation for every paper.

∑ real math 🔓 interactive proofs 🔬 advanced visualizations 📚 cited papers

Open the model deck Start from scratch

How to read this

Every model is dissected the same way.

Guided lesson: one idea per screen, plain words first, the picture beside it.
The math: each equation with a plain-English reading and a symbol legend.
Proof, step by step: unfold the derivation one move at a time; each step has a "why this move?" with the assumption or trick it uses.
Advanced visualization: a computational animation of the mechanism, plus a worked numerical example.
Assumptions · tricks · why this math · exercises · open problems, then the citation.

The deck

Pick a model.

Grouped by era. Start with Foundations if imitation learning is new.

How the ideas connect

One family tree.

Two rivers feed every modern VLA - internet-scale vision-language pretraining (the brain) and robot imitation data (the body) - joined by a choice of action representation (discrete tokens, diffusion, or flow). Open any model to see what it builds on and what it leads to.

References

Every paper discussed.

Click a title to open it on arXiv. This module is a teaching companion to the primary sources. Read them.

Foundational imitation-learning theory: Ross, Gordon & Bagnell, A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (AISTATS 2011), arXiv:1011.0686.

Related rooms

Built on the foundations.

World Models Transformers Diffusion Models Reinforcement Learning & Control All missions