Don't write the rules. Let the machine learn them from examples.
Old-school software is a stack of hand-written if/else rules.
Machine learning flips that: you show the computer lots of examples
and it discovers the pattern by itself. This is the Data Workshop:
a bench where you feed in examples and watch a model take shape.
By the end you will be able to:Tell rules from learningRead a datasetFit a line & a boundaryPick the right lossSpot overfittingCluster & classify
Level 1
Beginner: what is "learning from data"?
Plain words first. No math, no jargon. Just examples, and a machine that notices patterns.
The big idea
Rules are written by hand. Learning is grown from examples.
Suppose you want to tell cats from dogs in photos. With rules you'd try to
write them yourself: "if pointy ears and whiskers then cat...", and you'd fail fast,
because there are endless exceptions. With machine learning you instead show
the computer thousands of labelled photos and let it figure out the pattern.
Recipe vs. tasting analogy
A rules program is a fixed recipe. A learner is a chef who tastes
many dishes and learns what "good" means, then can judge a brand-new dish it has
never seen.
When to use which
Rules win: "Is this number even?" A one-line rule is perfect and exact.
Learning wins: "Is this email spam?" Too many fuzzy patterns to hand-write.
Rules win: converting Celsius to Fahrenheit, a known formula.
Learning wins: recognising a face, or predicting tomorrow's sales.
Rule of thumb: if you can write the exact rule, do. If the pattern is fuzzy and you have lots of examples, learn it.
Game · "Rules or Learn?"
score 0 / 0
Read each scenario and decide: would you solve it with hand-written rules, or by
learning from examples? Click your answer for instant feedback.
-
Scenario 1
The raw material
Examples, features, and labels: the words of the workshop.
An example is one thing you know about (one house, one email). Its
features are the measurable clues (size, bedrooms, location). The
label is the answer you want to predict (the price). A pile of examples,
each with features and a label, is a dataset.
Good features carry signal about the label. Useless features are noise;
they only distract the model.
A tiny house dataset
size (m²)
beds
shoe size of owner
price (k)
60
2
42
180
95
3
38
270
120
4
44
340
Two columns are features that predict price; one is pure noise. Can you spot the useless one? Play the game below.
Game · "Build the dataset"
signal 0%
You're predicting house price. Toggle the features you think actually
predict the price. Useful ones boost your signal; noise drags it down.
Aim for the strongest predictive set.
Pick the features that carry real signal about price.
Three flavours of learning
Supervised, unsupervised, reinforcement.
Supervised: you have labelled examples (questions and answers).
The model learns to predict the answer.
Unsupervised: no labels, just data. The model finds structure on its own,
like grouping similar things together.
Reinforcement: an agent acts in a world and learns from rewards and
penalties, trying to score as much as possible over time.
Everyday picture
Supervised: flashcards with the answer on the back.
Unsupervised: sorting your laundry into piles by colour, with nobody telling you the "right" piles.
Reinforcement: training a dog with treats. Good move, treat; bad move, no treat.
Game · "Which kind of learning?"
score 0 / 0
Sort each real scenario into the right family. Instant feedback explains the why.
-
Scenario 1
Worked example: a spam filter as ML
We collect 10,000 emails, each marked spam or not spam (labels). Features
might be: count of the word "free", number of links, ALL-CAPS ratio. The model reads all
10,000 and learns weights for each feature. On a brand-new email it adds up the evidence
and outputs a probability of spam.
10k labelled emails→extract features→train model→judge new email
This is supervised classification, exactly what we build hands-on in the next level.
How learning actually happens
The learning loop: guess, check the error, nudge, repeat.
A model doesn't get it right in one shot. It runs a little loop: take some
data, let the model make a prediction, compare that to the right
answer to get an error, then update the model a tiny bit to shrink that
error. Round and round, the error gets smaller and the model gets smarter.
That's the whole secret of "learning from examples": there's no magic, just thousands
of tiny corrections.
One trip around the loop
Data: a house that really sold for $300k.
Prediction: the model guesses $250k.
Error: it was $50k too low.
Update: nudge the weights up a touch, and the next guess is closer.
Animation · the learning loop
error shrinking...
Watch a packet of evidence travel the loop: data → model → prediction → error →
update, then back to the start. Each lap, the error bar on the right shrinks a little:
that's the model improving. Press pause if you want to study a single stage.
Every full lap = one round of training. The error never quite hits zero, and that's fine.
Level 2
Intermediate: supervised learning, hands on
Now we actually fit models: a line for numbers, a boundary for classes.
Two jobs
Regression predicts a number. Classification predicts a category.
Regression: the answer is a quantity on a scale, like a price, a temperature, a
height. "How much?"
Classification: the answer is one of a few buckets: spam/not-spam, cat/dog,
yes/no. "Which one?"
Spot the job
Regression: predict a house's price in dollars.
Regression: predict tomorrow's temperature.
Classification: is this tumour benign or malignant?
Classification: which of 10 digits is this handwritten number?
Game · "Fit the line" (linear regression)
MSE = -
The model is the straight line \(\hat y = wx + b\). Drag the slope and
intercept sliders (or drag the line's endpoints on the chart) to push the
line through the cloud of points and minimise the error. Lower MSE is better.
Then hit Auto-fit to watch gradient descent solve it.
best MSE: -
Game · "Draw the boundary" (classification)
accuracy -
Two classes of points sit on the plane. Move and rotate a straight
dividing line so that one class lands on each side. Your score is the share of points
classified correctly. Can you beat 95%?
In words: the line score \(z = w^\top x + b\) is a raw number from \(-\infty\) to \(+\infty\). The sigmoid \(\sigma\) squashes it into a probability between 0 and 1. We then call it class 1 when that probability passes a threshold (0.5 by default). This is logistic regression, the exact rule the "Auto-fit (logistic)" button above is solving for.
\(z = w^\top x + b\) - signed distance-ish score; \(z=0\) is the boundary line itself
threshold 0.5 - move it up to favour precision, down to favour recall (see the threshold demo above)
training - fit \(w,b\) by minimising the binary cross-entropy loss defined two cards down
Don't fool yourself
Train on some data, test on data it has never seen.
If you let a model memorise the answers, it'll ace the questions it studied and flop
on new ones. So we hide a slice of data (the test set), train on the rest, and
judge honestly on the hidden slice. That measures generalization.
Train vs test accuracy
train - · test -
Toggle "memorise hard" and watch train accuracy stay high while test accuracy collapses: the classic overfit gap.
Low flexibility = simple model. High = memoriser.
How "Auto-fit" really works
Gradient descent: roll downhill on the error landscape.
Every choice of slope \(w\) and intercept \(b\) gives some error. Plot that error for
all choices and you get a bowl-shaped loss surface. Training just means
standing somewhere on that bowl and repeatedly stepping downhill (in the
direction the slope drops fastest) until you reach the bottom: the best line.
The step size is the learning rate. Too small and you crawl; too big and you
bounce around or even fly out of the bowl. Watch both failure modes below.
The downhill rule
Each step nudges the weights opposite the gradient (the uphill direction):
\(\;w \leftarrow w - \eta\,\dfrac{\partial L}{\partial w}\). Big gradient → big step;
near the bottom the gradient flattens and the steps shrink automatically.
η too small: tiny steps, takes forever.
η just right: smooth glide to the minimum.
η too big: overshoots, zig-zags, can diverge.
Animation · gradient descent on the loss surface
loss -
The heat-map is the error for every \((w,b)\): dark valley = low error, bright rim =
high. Click anywhere to drop the ball there, then watch it roll downhill in real
time. Change the learning rate and re-run to see it crawl, glide, or overshoot.
Click the map to choose where the ball starts.
Worked example: predicting a tip
We learn \(\hat y = wx + b\) where \(x\) is the bill and \(y\) the tip. After training we
get \(w=0.18,\; b=0.5\). For a \(\$50\) bill: \(\hat y = 0.18(50) + 0.5 = \$9.5\). For a
\(\$20\) bill: \(\hat y = 0.18(20) + 0.5 = \$4.1\). The slope \(w\) is "tip per dollar",
the intercept \(b\) is a base tip.
Training is just finding the \(w\) and \(b\) that make these predictions as close to real tips as possible, which is exactly the MSE you minimised in the game.
Level 3
Advanced: the real math
Loss functions, the bias-variance trade-off, regularization, and metrics, typeset properly.
How we measure "wrong"
Two loss functions you'll meet everywhere.
Training is minimising a loss. Regression usually uses Mean Squared Error; classification uses cross-entropy.
\[ L \;=\; \frac{1}{n}\sum_{i=1}^{n}\bigl(\hat y_i - y_i\bigr)^2 \]
In words: for each example, measure how far the prediction missed,
square it (so misses count and signs can't cancel), then average over all examples.
This is Mean Squared Error (MSE).
\(n\) - number of examples
\(\hat y_i\) - the model's prediction for example \(i\)
\(y_i\) - the true target for example \(i\)
\((\hat y_i - y_i)^2\) - squared error of one example
\[ L \;=\; -\frac{1}{n}\sum_{i=1}^{n}\Bigl[\,y_i\log\hat y_i + (1-y_i)\log(1-\hat y_i)\,\Bigr] \]
In words: when the true label is 1 it punishes a low predicted
probability; when the true label is 0 it punishes a high one. Confident-and-wrong is
penalised harshly. This is binary cross-entropy (log loss).
\(\hat y_i\) - predicted probability the label is 1 (between 0 and 1)
\(y_i\) - the true label, 0 or 1
\(\log\) - natural log; \(\log\hat y_i\) is very negative when \(\hat y_i\) is tiny
the minus sign - flips it so smaller loss = better
See the loss penalties
drag a slider
Left: MSE grows like a bowl as a single prediction drifts from its true value. Right:
cross-entropy shoots toward infinity as a confident probability heads the wrong way.
The central trade-off
Underfit, overfit, and the sweet spot.
A model too simple underfits (high bias: misses the real pattern). A model too
flexible overfits (high variance: memorises noise). The art is the middle.
Game · "Pick the model"
degree 1
Slide the polynomial degree from 1 (a straight line) to 12 (a wild squiggle).
Watch the fit on the data, plus train vs validation error. Pick the degree that
generalizes best (lowest validation error) and lock it in to score.
Underfitting: the line is too stiff.
\[ L_{\text{reg}} \;=\; L \;+\; \lambda \lVert w \rVert_2^2 \;=\; L \;+\; \lambda \sum_j w_j^2 \]
In words: add a penalty for big weights to the loss. The model now
has to justify every large weight with a real drop in error, so it stays smoother and
overfits less. This is L2 (ridge) regularization.
\(L\) - the original data loss (e.g. MSE)
\(\lambda\) - regularization strength: bigger = simpler, smoother model
\(\lVert w \rVert_2^2 = \sum_j w_j^2\) - squared size of the weight vector (L2)
L1 variant - use \(\lambda\sum_j |w_j|\) to push weights to exactly 0 (sparsity)
Interactive · regularization smooths the fit
λ = 0.00
A degree-9 curve is fitted to noisy data. Raise \(\lambda\) and watch the wild wiggles relax into a calm, general curve.
In words: precision asks "of the things I flagged positive, how many
were right?"; recall asks "of all the real positives, how many did I catch?"; F1 is their
harmonic mean, high only when both are high.
\(FP\) - false positives (flagged positive but actually negative)
\(FN\) - false negatives (missed a real positive)
\(F_1\) - balances precision and recall in one number
Interactive · drag the threshold
threshold 0.50
Each dot is an example with a predicted score; reds are truly positive, greys truly
negative. Slide the decision threshold: everything to the right is called
"positive". Watch the confusion matrix and the metrics update live.
True Pos0
False Pos0
False Neg0
True Neg0
Accuracy -
Precision -
Recall -
F1 -
Low threshold = catch everything (high recall, low precision). High = be picky.
Every threshold at once
The ROC curve & AUC: judge a classifier across all thresholds.
A single threshold gives one precision/recall pair. But which threshold? The ROC
curve sweeps the threshold from high to low and plots the true-positive rate
against the false-positive rate at each step. The area under that curve (AUC)
is one tidy number: the chance the model ranks a random positive above a random negative.
0.5 is a coin flip; 1.0 is perfect.
In words: the true-positive rate is the share of real positives we
catch; the false-positive rate is the share of real negatives we wrongly flag. As we lower
the threshold both rise; the ROC curve traces that trade-off, and AUC is the total area
beneath it.
AUC - area under the ROC curve, between 0.5 (random) and 1 (perfect)
the diagonal - a model with no skill (AUC = 0.5)
Animation · sweep the threshold, trace the ROC
AUC -
Left: the same scored examples, with a threshold sliding from right (catch nothing) to
left (catch everything). Right: each threshold plots one point of the ROC curve,
and the area underneath fills in as the curve is drawn. Drag the threshold yourself, or
press Sweep to animate the whole curve. A harder dataset (more overlap)
pulls the curve toward the diagonal and drops AUC.
Bias is being consistently off-target (your shots cluster, but in the wrong place).
Variance is being scattered (your shots spray everywhere, even if they average out
near the centre). The four dartboards below animate every combination, and the goal is the
bottom-left: low bias and low variance.
Animation · the bias-variance dartboard
throwing...
Each board is a different model. Watch the darts (repeated trainings on fresh data) land:
tight-and-centred is ideal; tight-but-off is high bias; spread-but-centred is high
variance; spread-and-off is the worst of both.
More flexible models usually cut bias but raise variance: the central tension of ML.
In words: the dartboards above are this formula made visible. Expected squared error splits into exactly three pieces: how far off-centre the model is on average (bias), how much it jitters from one training set to the next (variance), and irreducible noise in the data you can never fit away. Lowering one usually raises another; that is the trade-off.
\(f(x)\) - the true function; \(\hat f(x)\) is what your model learned
\(\text{bias}^2\) - squared gap between the average prediction and the truth (too-simple models)
variance - how much \(\hat f\) swings as the training data changes (too-flexible models)
\(\sigma^2\) - noise floor; even a perfect model cannot beat it
Diagnose with data size
Learning curves: does more data help?
Plot training and validation error as the training set grows. Two telltale shapes
appear. High bias: both curves flatten at a high error and hug each other, so more
data won't help, you need a richer model. High variance: a big gap between low
training error and high validation error, and more data will help close it.
Animation · learning curves grow with the dataset
n = -
Drag the model from high bias (too simple) to high variance (too flexible)
and watch the two error curves redraw as the dataset grows from a handful of points to
many. Read the gap: a stubborn gap means more data helps; converged-but-high means you've
hit the model's ceiling.
Low capacity = high bias; high capacity = high variance.
Worked example: precision vs recall by hand
A test flags 100 emails as spam. 80 really are spam (\(TP=80\)); 20 were innocent
(\(FP=20\)). It also missed 40 real spams (\(FN=40\)).
It's precise (few false alarms) but misses a third of real spam (lower recall). Lowering the threshold would catch more spam at the cost of more false alarms.
Level 4
Expert: beyond the basics
Clustering, neighbours, trees, scaling, cross-validation, real code, and a final challenge.
Unsupervised · clustering
k-means: find groups with no labels at all.
You only have points, no answers. k-means drops \(k\) centre points
(centroids), then repeats two steps: assign each point to its nearest centroid,
and move each centroid to the average of its points. Repeat until nothing moves.
Inertia = how tight
We score a clustering by inertia: the total squared distance from every point to
its centroid. Lower inertia = tighter, better-placed clusters.
Game · "Place the centroids" (k-means)
inertia -
Click on the canvas to drop \(k\) centroids where you think the cluster centres
are. Then press Run iteration to do one assign-and-update step, or Auto-run
to converge. Try to reach the lowest inertia, and beat your best.
best inertia: -
Supervised · neighbours
k-NN: you are who your neighbours are.
To classify a new point, k-nearest-neighbours looks at the \(k\) closest known
points and takes a majority vote. No training at all, just remember the data and
measure distances at prediction time.
Choosing k
k = 1: follows every point exactly, so it is jumpy and sensitive to noise.
large k: smoother boundary, but can ignore small real clusters.
odd k: avoids ties in two-class votes.
Game · "Vote of the neighbours" (k-NN)
drag the query point
The big hollow dot is a query with unknown class. Drag it around. The \(k\)
nearest known points are circled and they vote on its colour. Change \(k\) and
watch the prediction flip near the boundary.
-
Supervised · the widest street
Support vector machines: find the boundary with the widest margin.
Many lines can separate two classes, but which is best? An SVM picks the one
that leaves the widest empty street between the classes. The points that touch
the edges of that street are the support vectors; they alone define the boundary.
A fatter margin generalises better to new points.
The margin
For a line \(w^\top x + b = 0\), the margin width is \(\dfrac{2}{\lVert w\rVert}\). SVM
maximises that while keeping every point on its correct side: a "maximum-margin"
classifier.
In words: real data rarely splits cleanly, so the soft-margin SVM lets points violate the street for a price. The first term widens the margin (small \(\lVert w\rVert\)); the hinge loss term charges for every point inside the margin or on the wrong side. The knob \(C\) sets the exchange rate between the two.
\(\max(0,\,1 - y_i(w^\top x_i + b))\) - hinge loss: zero once a point is safely past its margin, growing linearly as it crosses
\(y_i \in \{-1,+1\}\) - the label, so a correct, confident point makes \(y_i(w^\top x_i+b) \ge 1\)
small \(C\) - tolerate violations: wider, smoother margin, more bias, less variance
Game · "Widen the street" (max-margin SVM)
margin -
Two classes sit on the plane. Rotate and shift a separating line (or drag on
the chart) so it splits them and leaves the widest possible gap. The shaded band
is your margin; the circled points are the nearest on each side (the support vectors).
Score the widest correct margin, then hit Optimal to see the SVM's answer.
best correct margin: -
Unsupervised · dimensionality reduction
PCA: squash 2D into 1D, keeping as much spread as possible.
Sometimes two features are really telling you one thing. Principal component
analysis finds the single direction along which the data varies most (the
principal axis) and projects every point onto it. You trade a dimension for
simplicity while keeping the maximum possible variance (information).
Variance captured
Project onto a direction \(u\); the captured variance is \(u^\top \Sigma\, u\). PCA picks
the \(u\) that maximises it: the top eigenvector of the covariance \(\Sigma\). Spin the
line below and watch the captured fraction peak exactly on that axis.
Interactive · project onto the principal axis
variance captured -
The cloud is a stretched 2D blob. Rotate the projection line; each point drops a
foot onto it (its 1D coordinate). The readout shows what fraction of the total spread that
single line captures. Hit Snap to PC1 to jump to the optimal direction PCA would
choose.
Scree chart (right): PCA's two axes ranked by the spread they hold, and PC1 always captures the most. The marker shows what fraction of PC1's variance your current line is reaching.
The best line maximises the spread of the projected feet.
Wisdom of the crowd
Ensembles & bagging: many weak models vote into a strong one.
One shallow tree is jagged and unreliable. But train many of them, each on a
slightly different random sample of the data, and let them vote: the wobbles
cancel out and a smooth, stable boundary emerges. That's bagging (a random forest
is exactly this). The crowd beats any single member.
Why it works
Averaging independent, noisy guesses keeps the signal and cancels the noise; it cuts
variance without adding much bias. More voters → a smoother, steadier decision
boundary.
Animation · weak models vote into a smooth boundary
voters 1
Each faint line is one weak classifier trained on a random resample of the data. Add more
voters and watch their jagged individual guesses average into the bold, smooth
ensemble boundary. The background tint shows how confidently the crowd votes each region.
One voter is jagged; the crowd is smooth.
Expert · how a tree carves space
Decision-tree splitting: slicing the plane into pure boxes.
A decision tree classifies by asking yes/no questions like "is \(x > 0.4\)?" Each
question is an axis-aligned cut that splits a region in two. Deeper trees keep
cutting until each box is (nearly) one colour. The depth slider animates the cuts
appearing, and shows how a deep tree can carve tiny boxes around single points
(overfitting).
Greedy & axis-aligned
Each split picks the cut that best separates the classes (purest children).
Shallow tree: few big boxes, simple, may underfit.
Deep tree: many tiny boxes, can memorise noise.
Interactive · grow the tree, watch the regions split
depth 1 · purity -
Two coloured classes sit on the plane. Raise the depth and the tree recursively
cuts the space into axis-aligned regions, each shaded by its majority class. Watch purity
climb toward 100%, and notice the boxes getting suspiciously small at high depth.
Each level doubles the number of regions.
Decision trees & complexity
A tree asks yes/no questions, splitting the data until each leaf is pure. Deeper trees fit more, and overfit more. Drag the depth.
Feature scaling
If one feature spans 0 to 1000 and another 0 to 1, distance-based models obsess over the big one. Standardize: \(x' = \dfrac{x-\mu}{\sigma}\). Toggle scaling.
k-fold cross-validation
Instead of one test set, split data into \(k\) folds. Train on \(k-1\), test on the held-out fold, rotate, and average. A more honest score.
The same ideas in real code
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
# split: keep 20% hidden to test generalization
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0)
# regression: predict a number (y_hat = w·x + b)
reg = LinearRegression().fit(X_train, y_train)
print(reg.score(X_test, y_test)) # R^2 on unseen data
# classification: predict a class via the sigmoid
clf = LogisticRegression().fit(X_train, y_train)
print(clf.predict_proba(X_test)[:5]) # probabilities, not just labels
import torch, torch.nn as nn
model = nn.Linear(n_features, 1) # y_hat = w·x + b
opt = torch.optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss() # regression loss
for epoch in range(200):
opt.zero_grad() # clear old gradients
y_hat = model(X_train) # forward pass
loss = loss_fn(y_hat, y_train) # how wrong?
loss.backward() # gradients of loss w.r.t w,b
opt.step() # w <- w - lr * grad (gradient descent)
# for classification: swap to nn.BCEWithLogitsLoss + a sigmoid output
train_test_split - hides part of the data so the score reflects unseen examples, not memorisation.
LinearRegression / nn.Linear - both fit \(\hat y = wx + b\); sklearn solves it directly, PyTorch via gradient descent.
LogisticRegression - squashes the line through a sigmoid into a probability, then thresholds for a class.
the training loop - forward, measure loss, backpropagate gradients, take a descent step, repeat.
Worked example: one k-means step by hand
Six points on a line: \(\{1,\,2,\,3,\,9,\,10,\,11\}\), and \(k=2\). Start the centroids badly
at \(c_1 = 2\) and \(c_2 = 4\).
Assign: each point joins its nearest centroid. Points \(1,2,3\) are closer to \(c_1=2\);
points \(9,10,11\) are closer to \(c_2=4\) (e.g. \(|9-4|=5 < |9-2|=7\)). So
\(A_1=\{1,2,3\}\), \(A_2=\{9,10,11\}\).
Move: each centroid jumps to the mean of its members:
\(c_1 = \tfrac{1+2+3}{3} = 2\), \(c_2 = \tfrac{9+10+11}{3} = 10\).
Check: re-assigning with \(c_1=2,\,c_2=10\) changes nothing, so the algorithm has
converged in one update. Final inertia
\(= (1{-}2)^2+(2{-}2)^2+(3{-}2)^2 + (9{-}10)^2+(10{-}10)^2+(11{-}10)^2 = 4\).
k-means only ever lowers inertia, so it always converges, but to a local minimum that depends on the start. Bad seeds give bad clusters; that is why practice uses k-means++ seeding (spread the initial centroids apart) and several restarts, keeping the lowest-inertia run.
Challenge · earn the Machine Learning badge
0 / 22
Frontier · research-grade
The theory under classical ML.
The levels above are the practitioner's toolkit. This tier is the why: when does fitting
the data generalize, what error is reducible, how kernels buy nonlinearity for free, how boosting
turns weak rules strong, and why k-means always converges. Each topic is a guided lesson with
step-through proofs, a worked example, a visualization, and citations.