1 · Sense
Robot: cameras, distance sensors, bump switches gather raw data.
You: your eyes and ears notice the cars and the light.
Reinforcement Learning & Control
Every robot — a vacuum, a self-driving car, a Mars rover — runs the same tiny heartbeat over and over: sense the world, think about what to do, act through its motors, then learn from how it went. Round and round, many times a second.
Level 1 · Beginner
Plain version: a robot is just a fast, tireless version of you crossing a street. You look both ways (sense), you decide it is safe (think), you walk (act), and if a car honks you remember to look more carefully next time (learn).
Robot: cameras, distance sensors, bump switches gather raw data.
You: your eyes and ears notice the cars and the light.
Robot: a program turns "what I see" into "what I should do".
You: "no cars, light is green — I can go".
Robot: motors spin the wheels and the body moves.
You: your legs actually carry you across.
Robot: compares result to goal and nudges its behavior.
You: a near-miss teaches you to wait an extra beat.
Watch the robot run its heartbeat. Step moves one phase at a time; Auto-run cycles forever.
The whole cycle takes a fraction of a second — and then it repeats.
Level 2 · Intermediate
Now we look inside the loop. Sensors produce observations. A policy is the rule-book that maps "what I observe" to "what I do". Controllers and motors execute the chosen move, and feedback corrects the next decision.
The robot wants the green goal but must dodge the rose wall. Its reactive policy mixes two urges — head toward the goal and push away from the obstacle. Tweak the weights, then run.
A different kind of policy: instead of reacting, the robot figures out how good every square is. Each sweep updates a square's value from its best neighbour — goodness flows outward from the green dock and away from the rose pit. The arrows show the resulting greedy policy: always step toward the most valuable neighbour.
Press “Run one sweep” to spread value out from the dock.
Say the goal is to the right (urge = →, strength 1.0) and a wall sits just to the upper-right (push = ↓←, strength 1.2 × closeness). The policy adds the two arrows like vectors. The result points right-and-slightly-down, so the robot curves under the wall instead of into it. Stronger avoid weight bends the path more.
Level 3 · Advanced
The same heartbeat has a precise notation used across robotics and reinforcement learning. Each formula below comes with a plain reading, a symbol legend, and a tiny visual.
In words: the policy is the brain's rule-book — feed it what you observe right now and it hands back the action to take (either one fixed action, or a probability over actions you then sample from).
oₜ → π → aₜ
In words: the world has its own physics. Take the current state, apply the robot's action, and the world rolls forward into the next state.
Note: the robot usually can't see \(s_t\) directly — it only gets \(o_t\), a partial, noisy view.
In words: add up all future rewards, but trust the far-off ones less. The discount \(\gamma\) shrinks each step further out, so "a treat now" beats "a treat much later".
In words: the value of a state is the average return you expect if you start there and keep following policy \(\pi\). The clever part on the right: that long sum \(G_t\) folds into one short step — the reward you get now, plus the (discounted) value of wherever you land next. A robot never has to add up an infinite future; it only compares “here” to “one step ahead”.
This recursive form is the Bellman expectation equation. Replacing the average with a “pick the best action” \(\max_a\) turns it into the Bellman optimality equation — the fixed point the Frontier tier solves by value iteration.
Reward +10 for reaching the dock, −1 for every tick spent driving, −20 for a crash. With \(\gamma=0.9\), a +10 reward arriving 5 ticks from now is only worth \(0.9^{5}\times 10 \approx 5.9\) today. That math is exactly why a discounted robot prefers the short safe route over a long scenic one.
Level 4 · Expert
The "learn" arrow has two flavors. Feedback control fixes errors right now (a thermostat, a self-balancing scooter). Reinforcement learning slowly reshapes the whole policy to earn more reward over time.
In words: measure how far off you are (the error), then push back in proportion to it. Far off → push hard; nearly there → ease off. That single idea is the heart of a P-controller (the "P" in PID).
Too small a \(K_p\): sluggish, never quite arrives. Too big: it overshoots and oscillates. The demo below lets you feel it.
In words: a P-controller only reacts to the error right now. Real robots add two more terms. The I (integral) term sums up past error, so a small, stubborn offset that never quite closes eventually gets stamped out. The D (derivative) term watches how fast the error is changing and pushes against it — that braking is what stops the ringing the Gain Tuner shows at high \(K_p\).
In code the integral is a running sum and the derivative is “this error minus last error” per tick. Tuning is a balance: more \(K_d\) settles the oscillation you can already trigger in the demo above, while too much \(K_i\) makes it sluggish and prone to overshoot (“integral windup”).
The dot must reach the dashed target line using only \(u_t = K_p\,e_t\). Drag the gain. Watch it crawl, settle smoothly, ring like a bell, or blow up.
while running:
obs = sensors.read() # SENSE -> o_t
action = policy(obs) # THINK -> a_t = pi(o_t)
motors.execute(action) # ACT -> world: s_{t+1}=f(s_t,a_t)
reward = environment.feedback() # observe r_t
policy.update(obs, action, reward) # LEARN -> nudge pi toward more reward
sensors.read() — the SENSE phase; produces the observation \(o_t\).
policy(obs) — the THINK phase; the rule-book chooses \(a_t=\pi(o_t)\).
motors.execute(action) — the ACT phase; the world advances to \(s_{t+1}=f(s_t,a_t)\).
environment.feedback() — the reward \(r_t\) tells us how that went.
policy.update(...) — the LEARN phase; control fixes the error now, RL reshapes \(\pi\) for next time.
Q1. A robot can only read \(o_t\), never the true \(s_t\) directly. What does that mean?
Q2. You raise \(K_p\) far too high in the Gain Tuner. What happens?
Q3. With discount \(\gamma=0.9\), why might a rover skip a far-away big reward?
Q4. In the sense–think–act–learn loop, which phase chooses the action?
Q5. A proportional controller drives the error \(e_t\) toward zero with \(u_t=K_p e_t\). What is \(e_t\)?
Q6. In the Grid-World Navigator, raising the obstacle-avoid weight relative to the seek weight makes the robot…
Q7. Pure proportional control often leaves a steady-state error or rings around the target. Which PID term mainly damps that ringing?
Q8. The Bellman optimality equation \(V^\star(s)=\max_a\!\big[r+\gamma\sum_{s'}P(s'\mid s,a)\,V^\star(s')\big]\) says the value of a state is…
Q9. Q-learning is called off-policy because…
Q10. PPO multiplies the advantage by the probability ratio \(r_t(\theta)=\tfrac{\pi_\theta(a\mid s)}{\pi_{\theta_{\text{old}}}(a\mid s)}\) but then clips it to \([1-\epsilon,\,1+\epsilon]\). Why?
Error \(e_t\) = (distance of the line from center). With small \(K_p\) the robot drifts off on sharp turns (under-correction). Crank \(K_p\) up and it weaves left-right across the line (overshoot/oscillation). Real robots add a damping term (the "D" in PID) to calm that ringing — the same lesson the Gain Tuner teaches by feel.
Recap
Frontier ยท research-grade
The levels above build the sense-think-act loop and proportional control. This tier is the theory of learning to act: the Bellman optimality operator and value iteration, the policy-gradient theorem, Q-learning, PPO, and the LQR optimal controller. Each topic is a guided lesson with step-through proofs, a worked example, a visualization, and citations.