Reinforcement Learning & Control

A robot is a loop with wheels, sensors, and opinions.

Every robot — a vacuum, a self-driving car, a Mars rover — runs the same tiny heartbeat over and over: sense the world, think about what to do, act through its motors, then learn from how it went. Round and round, many times a second.

Start the loop Jump to the controls

Level 1 · Beginner

The robot heartbeat: Sense → Think → Act → Learn

Plain version: a robot is just a fast, tireless version of you crossing a street. You look both ways (sense), you decide it is safe (think), you walk (act), and if a car honks you remember to look more carefully next time (learn).

👁

1 · Sense

Robot: cameras, distance sensors, bump switches gather raw data.

You: your eyes and ears notice the cars and the light.

🧠

2 · Think

Robot: a program turns "what I see" into "what I should do".

You: "no cars, light is green — I can go".

⚙️

3 · Act

Robot: motors spin the wheels and the body moves.

You: your legs actually carry you across.

🎯

4 · Learn

Robot: compares result to goal and nudges its behavior.

You: a near-miss teaches you to wait an extra beat.

Try it · The Reflex Loop

SENSE

Watch the robot run its heartbeat. Step moves one phase at a time; Auto-run cycles forever.

Worked example — a robot vacuum hits a chair leg

Sense — the front bump sensor fires: "something solid ahead".
Think — rule says "if bumped, back up and turn right 90°".
Act — wheels reverse, then one wheel spins to rotate.
Learn — it maps that spot as "obstacle here" so it avoids it sooner next pass.

The whole cycle takes a fraction of a second — and then it repeats.

Level 2 · Intermediate

Open up each phase

Now we look inside the loop. Sensors produce observations. A policy is the rule-book that maps "what I observe" to "what I do". Controllers and motors execute the chosen move, and feedback corrects the next decision.

Sensors → observationCamera (pixels), lidar (distances), touch (on/off). Together they form a snapshot of the world.

Policy: observation → actionA function that scores possible moves. Could be a hand-written rule or a learned network.

Controller / motorsTurn a desired action ("turn left") into voltages that spin real wheels.

FeedbackDid the action move us closer to the goal? Adjust before the next tick.

Sensors fuse into one observation; the policy reads it and commands the motors.

Try it · Grid-World Navigator

ready

The robot wants the green goal but must dodge the rose wall. Its reactive policy mixes two urges — head toward the goal and push away from the obstacle. Tweak the weights, then run.

Try it · Value-Map Builder

sweep 0

A different kind of policy: instead of reacting, the robot figures out how good every square is. Each sweep updates a square's value from its best neighbour — goodness flows outward from the green dock and away from the rose pit. The arrows show the resulting greedy policy: always step toward the most valuable neighbour.

Press “Run one sweep” to spread value out from the dock.

Worked example — reading the policy

Say the goal is to the right (urge = →, strength 1.0) and a wall sits just to the upper-right (push = ↓←, strength 1.2 × closeness). The policy adds the two arrows like vectors. The result points right-and-slightly-down, so the robot curves under the wall instead of into it. Stronger avoid weight bends the path more.

Level 3 · Advanced

The loop, written in math

The same heartbeat has a precise notation used across robotics and reinforcement learning. Each formula below comes with a plain reading, a symbol legend, and a tiny visual.

One control tick \(t\): the world's hidden state is observed, the policy thinks, the action changes the state and yields a reward — then it loops.

\[ a_t = \pi(o_t) \qquad\text{or, stochastically,}\qquad a_t \sim \pi(\cdot \mid o_t) \]

In words: the policy is the brain's rule-book — feed it what you observe right now and it hands back the action to take (either one fixed action, or a probability over actions you then sample from).

\(o_t\) — observation at tick \(t\) (the sensor snapshot)
\(a_t\) — the action taken at tick \(t\)
\(\pi\) — the policy (deterministic map, or a distribution)

oₜ → π → aₜ

\[ s_{t+1} = f(s_t,\, a_t) \]

In words: the world has its own physics. Take the current state, apply the robot's action, and the world rolls forward into the next state.

\(s_t\) — the true (often hidden) state of the world now
\(a_t\) — the action just taken
\(f\) — the environment dynamics (the physics)
\(s_{t+1}\) — the state one tick later

Note: the robot usually can't see \(s_t\) directly — it only gets \(o_t\), a partial, noisy view.

\[ G_t = \sum_{k\ge 0} \gamma^{k}\, r_{t+k} \;=\; r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \]

In words: add up all future rewards, but trust the far-off ones less. The discount \(\gamma\) shrinks each step further out, so "a treat now" beats "a treat much later".

\(r_t\) — reward received at tick \(t\) (good things +, bad things −)
\(G_t\) — the return: total discounted reward from \(t\) onward
\(\gamma\in[0,1)\) — discount factor (how much the future counts)
\(k\) — steps into the future

Discount \(\gamma\) = 0.90

\[ V^{\pi}(s) = \mathbb{E}_{\pi}\!\left[\,G_t \mid s_t = s\,\right] \;=\; \mathbb{E}_{\pi}\!\left[\,r_t + \gamma\, V^{\pi}(s_{t+1}) \mid s_t = s\,\right] \]

In words: the value of a state is the average return you expect if you start there and keep following policy \(\pi\). The clever part on the right: that long sum \(G_t\) folds into one short step — the reward you get now, plus the (discounted) value of wherever you land next. A robot never has to add up an infinite future; it only compares “here” to “one step ahead”.

\(V^{\pi}(s)\) — the value of state \(s\) under policy \(\pi\) (expected discounted return)
\(\mathbb{E}_{\pi}[\cdot]\) — average over the rewards and next states \(\pi\) actually produces
\(r_t + \gamma V^{\pi}(s_{t+1})\) — one-step reward plus the discounted value of the next state
\(\gamma\) — the same discount as in \(G_t\) above, reused here

This recursive form is the Bellman expectation equation. Replacing the average with a “pick the best action” \(\max_a\) turns it into the Bellman optimality equation — the fixed point the Frontier tier solves by value iteration.

Worked example — a delivery rover

Reward +10 for reaching the dock, −1 for every tick spent driving, −20 for a crash. With \(\gamma=0.9\), a +10 reward arriving 5 ticks from now is only worth \(0.9^{5}\times 10 \approx 5.9\) today. That math is exactly why a discounted robot prefers the short safe route over a long scenic one.

Level 4 · Expert

Closing the loop: control & learning

The "learn" arrow has two flavors. Feedback control fixes errors right now (a thermostat, a self-balancing scooter). Reinforcement learning slowly reshapes the whole policy to earn more reward over time.

\[ e_t = \text{target} - \text{measured},\qquad u_t = K_p\, e_t \]

In words: measure how far off you are (the error), then push back in proportion to it. Far off → push hard; nearly there → ease off. That single idea is the heart of a P-controller (the "P" in PID).

\(e_t\) — error: target minus what was measured
\(u_t\) — control output (the correction sent to the motor)
\(K_p\) — proportional gain (how aggressively you react)

Too small a \(K_p\): sluggish, never quite arrives. Too big: it overshoots and oscillates. The demo below lets you feel it.

\[ u_t = K_p\, e_t \;+\; K_i \!\int_0^{t} e_\tau \, d\tau \;+\; K_d\, \frac{d e_t}{dt} \]

In words: a P-controller only reacts to the error right now. Real robots add two more terms. The I (integral) term sums up past error, so a small, stubborn offset that never quite closes eventually gets stamped out. The D (derivative) term watches how fast the error is changing and pushes against it — that braking is what stops the ringing the Gain Tuner shows at high \(K_p\).

\(K_p\, e_t\) — present: react in proportion to the current error
\(K_i \int e\,d\tau\) — past: accumulate lingering error to erase steady-state offset
\(K_d\, \tfrac{de}{dt}\) — future: anticipate by damping fast changes (calms overshoot)

In code the integral is a running sum and the derivative is “this error minus last error” per tick. Tuning is a balance: more \(K_d\) settles the oscillation you can already trigger in the demo above, while too much \(K_i\) makes it sluggish and prone to overshoot (“integral windup”).

Try it · Gain Tuner (1D seek & balance)

stable

The dot must reach the dashed target line using only \(u_t = K_p\,e_t\). Drag the gain. Watch it crawl, settle smoothly, ring like a bell, or blow up.

The sim-to-real gapA policy perfect in simulation can fail on hardware: real sensors are noisy, motors lag, friction differs. Tick the noise box above to see a smooth controller start to jitter.

RL improves the policyInstead of hand-tuning rules, reinforcement learning adjusts \(\pi\) to maximize the expected return \(\mathbb{E}[G_t]\): actions that led to reward become more likely, ones that led to crashes become less likely.

Latest (2024–2026)Modern robots increasingly cross the sim-to-real gap with domain randomization — training in simulation across thousands of randomized frictions, masses, and sensor delays so the learned policy treats the messy real world as just one more variation. Locomotion controllers for legged robots are now routinely trained this way and deployed zero-shot, and policies are often fine-tuned with PPO-style updates on the physical robot to close the last of the gap.

The loop in pseudocode

while running:
    obs    = sensors.read()          # SENSE  -> o_t
    action = policy(obs)             # THINK  -> a_t = pi(o_t)
    motors.execute(action)          # ACT    -> world: s_{t+1}=f(s_t,a_t)
    reward = environment.feedback() # observe r_t
    policy.update(obs, action, reward)  # LEARN -> nudge pi toward more reward

Challenge · check your reflexes

0 / 10

Q1. A robot can only read \(o_t\), never the true \(s_t\) directly. What does that mean?

Q2. You raise \(K_p\) far too high in the Gain Tuner. What happens?

Q3. With discount \(\gamma=0.9\), why might a rover skip a far-away big reward?

Q4. In the sense–think–act–learn loop, which phase chooses the action?

Q5. A proportional controller drives the error \(e_t\) toward zero with \(u_t=K_p e_t\). What is \(e_t\)?

Q6. In the Grid-World Navigator, raising the obstacle-avoid weight relative to the seek weight makes the robot…

Q7. Pure proportional control often leaves a steady-state error or rings around the target. Which PID term mainly damps that ringing?

Q8. The Bellman optimality equation \(V^\star(s)=\max_a\!\big[r+\gamma\sum_{s'}P(s'\mid s,a)\,V^\star(s')\big]\) says the value of a state is…

Q9. Q-learning is called off-policy because…

Q10. PPO multiplies the advantage by the probability ratio \(r_t(\theta)=\tfrac{\pi_\theta(a\mid s)}{\pi_{\theta_{\text{old}}}(a\mid s)}\) but then clips it to \([1-\epsilon,\,1+\epsilon]\). Why?

Worked example — tuning a line-following robot

Error \(e_t\) = (distance of the line from center). With small \(K_p\) the robot drifts off on sharp turns (under-correction). Crank \(K_p\) up and it weaves left-right across the line (overshoot/oscillation). Real robots add a damping term (the "D" in PID) to calm that ringing — the same lesson the Gain Tuner teaches by feel.

Recap

You can now run the loop

BeginnerSense → Think → Act → Learn is one repeating heartbeat.

IntermediateSensors make observations; a policy maps them to actions; motors execute; feedback corrects.

Advanced\(a_t=\pi(o_t)\), \(s_{t+1}=f(s_t,a_t)\), and the discounted return \(G_t\).

ExpertProportional control \(u_t=K_p e_t\), the sim-to-real gap, and RL improving \(\pi\).

Frontier · research-grade

From the control loop to reinforcement learning.

The levels above build the sense-think-act loop and proportional control. This tier is the theory of learning to act: the Bellman optimality operator and value iteration, the policy-gradient theorem, Q-learning, PPO, and the LQR optimal controller. Each topic is a guided lesson with step-through proofs, a worked example, a visualization, and citations.