Machine Learning

Don't write the rules. Let the machine learn them from examples.

Old-school software is a stack of hand-written if/else rules. Machine learning flips that: you show the computer lots of examples and it discovers the pattern by itself. This is the Data Workshop: a bench where you feed in examples and watch a model take shape.

Enter the workshop Skip to Expert

By the end you will be able to: Tell rules from learning Read a dataset Fit a line & a boundary Pick the right loss Spot overfitting Cluster & classify

Level 1

Beginner: what is "learning from data"?

Plain words first. No math, no jargon. Just examples, and a machine that notices patterns.

The big idea

Rules are written by hand. Learning is grown from examples.

Suppose you want to tell cats from dogs in photos. With rules you'd try to write them yourself: "if pointy ears and whiskers then cat...", and you'd fail fast, because there are endless exceptions. With machine learning you instead show the computer thousands of labelled photos and let it figure out the pattern.

Recipe vs. tasting analogy

A rules program is a fixed recipe. A learner is a chef who tastes many dishes and learns what "good" means, then can judge a brand-new dish it has never seen.

When to use which

Rules win: "Is this number even?" A one-line rule is perfect and exact.
Learning wins: "Is this email spam?" Too many fuzzy patterns to hand-write.
Rules win: converting Celsius to Fahrenheit, a known formula.
Learning wins: recognising a face, or predicting tomorrow's sales.

Rule of thumb: if you can write the exact rule, do. If the pattern is fuzzy and you have lots of examples, learn it.

Game · "Rules or Learn?"

score 0 / 0

Read each scenario and decide: would you solve it with hand-written rules, or by learning from examples? Click your answer for instant feedback.

Scenario 1

The raw material

Examples, features, and labels: the words of the workshop.

An example is one thing you know about (one house, one email). Its features are the measurable clues (size, bedrooms, location). The label is the answer you want to predict (the price). A pile of examples, each with features and a label, is a dataset.

Good features carry signal about the label. Useless features are noise; they only distract the model.

A tiny house dataset

size (m²)	beds	shoe size of owner	price (k)
60	2	42	180
95	3	38	270
120	4	44	340

Two columns are features that predict price; one is pure noise. Can you spot the useless one? Play the game below.

Game · "Build the dataset"

signal 0%

You're predicting house price. Toggle the features you think actually predict the price. Useful ones boost your signal; noise drags it down. Aim for the strongest predictive set.

Pick the features that carry real signal about price.

Three flavours of learning

Supervised, unsupervised, reinforcement.

Supervised: you have labelled examples (questions and answers). The model learns to predict the answer.

Unsupervised: no labels, just data. The model finds structure on its own, like grouping similar things together.

Reinforcement: an agent acts in a world and learns from rewards and penalties, trying to score as much as possible over time.

Everyday picture

Supervised: flashcards with the answer on the back.
Unsupervised: sorting your laundry into piles by colour, with nobody telling you the "right" piles.
Reinforcement: training a dog with treats. Good move, treat; bad move, no treat.

Game · "Which kind of learning?"

score 0 / 0

Sort each real scenario into the right family. Instant feedback explains the why.

Scenario 1

Worked example: a spam filter as ML

We collect 10,000 emails, each marked spam or not spam (labels). Features might be: count of the word "free", number of links, ALL-CAPS ratio. The model reads all 10,000 and learns weights for each feature. On a brand-new email it adds up the evidence and outputs a probability of spam.

This is supervised classification, exactly what we build hands-on in the next level.

How learning actually happens

The learning loop: guess, check the error, nudge, repeat.

A model doesn't get it right in one shot. It runs a little loop: take some data, let the model make a prediction, compare that to the right answer to get an error, then update the model a tiny bit to shrink that error. Round and round, the error gets smaller and the model gets smarter.

That's the whole secret of "learning from examples": there's no magic, just thousands of tiny corrections.

One trip around the loop

Data: a house that really sold for $300k.
Prediction: the model guesses $250k.
Error: it was $50k too low.
Update: nudge the weights up a touch, and the next guess is closer.

Animation · the learning loop

error shrinking...

Watch a packet of evidence travel the loop: data → model → prediction → error → update, then back to the start. Each lap, the error bar on the right shrinks a little: that's the model improving. Press pause if you want to study a single stage.

Every full lap = one round of training. The error never quite hits zero, and that's fine.

Level 2

Intermediate: supervised learning, hands on

Now we actually fit models: a line for numbers, a boundary for classes.

Two jobs

Regression predicts a number. Classification predicts a category.

Regression: the answer is a quantity on a scale, like a price, a temperature, a height. "How much?"

Classification: the answer is one of a few buckets: spam/not-spam, cat/dog, yes/no. "Which one?"

Spot the job

Regression: predict a house's price in dollars.
Regression: predict tomorrow's temperature.
Classification: is this tumour benign or malignant?
Classification: which of 10 digits is this handwritten number?

Game · "Fit the line" (linear regression)

MSE = -

The model is the straight line $\hat y = wx + b$. Drag the slope and intercept sliders (or drag the line's endpoints on the chart) to push the line through the cloud of points and minimise the error. Lower MSE is better. Then hit Auto-fit to watch gradient descent solve it.

slope w 0.40 intercept b 20 best MSE: -

Game · "Draw the boundary" (classification)

accuracy -

Two classes of points sit on the plane. Move and rotate a straight dividing line so that one class lands on each side. Your score is the share of points classified correctly. Can you beat 95%?

angle 20° offset 0 Tip: drag on the chart to move the line directly.

\[ z \;=\; w^\top x + b, \qquad \sigma(z) \;=\; \frac{1}{1 + e^{-z}}, \qquad \hat y \;=\; \begin{cases} 1 & \sigma(z) \ge 0.5 \\ 0 & \sigma(z) < 0.5 \end{cases} \]

In words: the line score $z = w^\top x + b$ is a raw number from $-\infty$ to $+\infty$. The sigmoid $\sigma$ squashes it into a probability between 0 and 1. We then call it class 1 when that probability passes a threshold (0.5 by default). This is logistic regression, the exact rule the "Auto-fit (logistic)" button above is solving for.

$z = w^\top x + b$ - signed distance-ish score; $z=0$ is the boundary line itself
$\sigma(z)$ - sigmoid: $\sigma(0)=0.5$, $\sigma(+\infty)\to 1$, $\sigma(-\infty)\to 0$
threshold 0.5 - move it up to favour precision, down to favour recall (see the threshold demo above)
training - fit $w,b$ by minimising the binary cross-entropy loss defined two cards down

Don't fool yourself

Train on some data, test on data it has never seen.

If you let a model memorise the answers, it'll ace the questions it studied and flop on new ones. So we hide a slice of data (the test set), train on the rest, and judge honestly on the hidden slice. That measures generalization.

Train vs test accuracy

train - · test -

Toggle "memorise hard" and watch train accuracy stay high while test accuracy collapses: the classic overfit gap.

model flexibility 3 Low flexibility = simple model. High = memoriser.

How "Auto-fit" really works

Gradient descent: roll downhill on the error landscape.

Every choice of slope $w$ and intercept $b$ gives some error. Plot that error for all choices and you get a bowl-shaped loss surface. Training just means standing somewhere on that bowl and repeatedly stepping downhill (in the direction the slope drops fastest) until you reach the bottom: the best line.

The step size is the learning rate. Too small and you crawl; too big and you bounce around or even fly out of the bowl. Watch both failure modes below.

The downhill rule

Each step nudges the weights opposite the gradient (the uphill direction): $\;w \leftarrow w - \eta\,\dfrac{\partial L}{\partial w}$. Big gradient → big step; near the bottom the gradient flattens and the steps shrink automatically.

η too small: tiny steps, takes forever.
η just right: smooth glide to the minimum.
η too big: overshoots, zig-zags, can diverge.

Animation · gradient descent on the loss surface

loss -

The heat-map is the error for every $(w,b)$: dark valley = low error, bright rim = high. Click anywhere to drop the ball there, then watch it roll downhill in real time. Change the learning rate and re-run to see it crawl, glide, or overshoot.

learning rate η 0.35 Click the map to choose where the ball starts.

Worked example: predicting a tip

We learn $\hat y = wx + b$ where $x$ is the bill and $y$ the tip. After training we get $w=0.18,\; b=0.5$. For a $\$50$ bill: $\hat y = 0.18(50) + 0.5 = \$9.5$. For a $\$20$ bill: $\hat y = 0.18(20) + 0.5 = \$4.1$. The slope $w$ is "tip per dollar", the intercept $b$ is a base tip.

Training is just finding the $w$ and $b$ that make these predictions as close to real tips as possible, which is exactly the MSE you minimised in the game.

Level 3

Advanced: the real math

Loss functions, the bias-variance trade-off, regularization, and metrics, typeset properly.

How we measure "wrong"

Two loss functions you'll meet everywhere.

Training is minimising a loss. Regression usually uses Mean Squared Error; classification uses cross-entropy.

\[ L \;=\; \frac{1}{n}\sum_{i=1}^{n}\bigl(\hat y_i - y_i\bigr)^2 \]

In words: for each example, measure how far the prediction missed, square it (so misses count and signs can't cancel), then average over all examples. This is Mean Squared Error (MSE).

$n$ - number of examples
$\hat y_i$ - the model's prediction for example $i$
$y_i$ - the true target for example $i$
$(\hat y_i - y_i)^2$ - squared error of one example

\[ L \;=\; -\frac{1}{n}\sum_{i=1}^{n}\Bigl[\,y_i\log\hat y_i + (1-y_i)\log(1-\hat y_i)\,\Bigr] \]

In words: when the true label is 1 it punishes a low predicted probability; when the true label is 0 it punishes a high one. Confident-and-wrong is penalised harshly. This is binary cross-entropy (log loss).

$\hat y_i$ - predicted probability the label is 1 (between 0 and 1)
$y_i$ - the true label, 0 or 1
$\log$ - natural log; $\log\hat y_i$ is very negative when $\hat y_i$ is tiny
the minus sign - flips it so smaller loss = better

See the loss penalties

drag a slider

Left: MSE grows like a bowl as a single prediction drifts from its true value. Right: cross-entropy shoots toward infinity as a confident probability heads the wrong way.

prediction $\hat y$ 0.70 true label = 1

The central trade-off

Underfit, overfit, and the sweet spot.

A model too simple underfits (high bias: misses the real pattern). A model too flexible overfits (high variance: memorises noise). The art is the middle.

Game · "Pick the model"

degree 1

Slide the polynomial degree from 1 (a straight line) to 12 (a wild squiggle). Watch the fit on the data, plus train vs validation error. Pick the degree that generalizes best (lowest validation error) and lock it in to score.

polynomial degree 1 Underfitting: the line is too stiff.

\[ L_{\text{reg}} \;=\; L \;+\; \lambda \lVert w \rVert_2^2 \;=\; L \;+\; \lambda \sum_j w_j^2 \]

In words: add a penalty for big weights to the loss. The model now has to justify every large weight with a real drop in error, so it stays smoother and overfits less. This is L2 (ridge) regularization.

$L$ - the original data loss (e.g. MSE)
$\lambda$ - regularization strength: bigger = simpler, smoother model
$\lVert w \rVert_2^2 = \sum_j w_j^2$ - squared size of the weight vector (L2)
L1 variant - use $\lambda\sum_j |w_j|$ to push weights to exactly 0 (sparsity)

Interactive · regularization smooths the fit

λ = 0.00

A degree-9 curve is fitted to noisy data. Raise $\lambda$ and watch the wild wiggles relax into a calm, general curve.

λ (L2 strength) 0.00 λ = 0 lets the curve chase every point (overfit).

Judging a classifier

Accuracy isn't enough. Meet precision, recall & F1.

\[ \text{Precision}=\frac{TP}{TP+FP},\quad \text{Recall}=\frac{TP}{TP+FN},\quad F_1=\frac{2\,\text{P}\cdot\text{R}}{\text{P}+\text{R}} \]

In words: precision asks "of the things I flagged positive, how many were right?"; recall asks "of all the real positives, how many did I catch?"; F1 is their harmonic mean, high only when both are high.

$TP$ - true positives (correctly flagged positive)
$FP$ - false positives (flagged positive but actually negative)
$FN$ - false negatives (missed a real positive)
$F_1$ - balances precision and recall in one number

Interactive · drag the threshold

threshold 0.50

Each dot is an example with a predicted score; reds are truly positive, greys truly negative. Slide the decision threshold: everything to the right is called "positive". Watch the confusion matrix and the metrics update live.

True Pos0

False Pos0

False Neg0

True Neg0

Accuracy -
Precision -
Recall -
F1 -

threshold 0.50 Low threshold = catch everything (high recall, low precision). High = be picky.

Every threshold at once

The ROC curve & AUC: judge a classifier across all thresholds.

A single threshold gives one precision/recall pair. But which threshold? The ROC curve sweeps the threshold from high to low and plots the true-positive rate against the false-positive rate at each step. The area under that curve (AUC) is one tidy number: the chance the model ranks a random positive above a random negative. 0.5 is a coin flip; 1.0 is perfect.

\[ \text{TPR}=\frac{TP}{TP+FN},\qquad \text{FPR}=\frac{FP}{FP+TN},\qquad \text{AUC}=\int_0^1 \text{TPR}\,\mathrm{d(FPR)} \]

In words: the true-positive rate is the share of real positives we catch; the false-positive rate is the share of real negatives we wrongly flag. As we lower the threshold both rise; the ROC curve traces that trade-off, and AUC is the total area beneath it.

TPR - true-positive rate (a.k.a. recall / sensitivity)
FPR - false-positive rate (1 − specificity)
AUC - area under the ROC curve, between 0.5 (random) and 1 (perfect)
the diagonal - a model with no skill (AUC = 0.5)

Animation · sweep the threshold, trace the ROC

AUC -

Left: the same scored examples, with a threshold sliding from right (catch nothing) to left (catch everything). Right: each threshold plots one point of the ROC curve, and the area underneath fills in as the curve is drawn. Drag the threshold yourself, or press Sweep to animate the whole curve. A harder dataset (more overlap) pulls the curve toward the diagonal and drops AUC.

threshold 0.50 separation good Lower separation = classes overlap = AUC drops toward 0.5.

A picture of the trade-off

Bias vs variance: the dartboard.

Bias is being consistently off-target (your shots cluster, but in the wrong place). Variance is being scattered (your shots spray everywhere, even if they average out near the centre). The four dartboards below animate every combination, and the goal is the bottom-left: low bias and low variance.

Animation · the bias-variance dartboard

throwing...

Each board is a different model. Watch the darts (repeated trainings on fresh data) land: tight-and-centred is ideal; tight-but-off is high bias; spread-but-centred is high variance; spread-and-off is the worst of both.

More flexible models usually cut bias but raise variance: the central tension of ML.

\[ \mathbb{E}\bigl[(y - \hat f(x))^2\bigr] \;=\; \underbrace{\bigl(\mathbb{E}[\hat f(x)] - f(x)\bigr)^2}_{\text{bias}^2} \;+\; \underbrace{\mathbb{E}\bigl[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\bigr]}_{\text{variance}} \;+\; \underbrace{\sigma^2}_{\text{noise}} \]

In words: the dartboards above are this formula made visible. Expected squared error splits into exactly three pieces: how far off-centre the model is on average (bias), how much it jitters from one training set to the next (variance), and irreducible noise in the data you can never fit away. Lowering one usually raises another; that is the trade-off.

$f(x)$ - the true function; $\hat f(x)$ is what your model learned
$\text{bias}^2$ - squared gap between the average prediction and the truth (too-simple models)
variance - how much $\hat f$ swings as the training data changes (too-flexible models)
$\sigma^2$ - noise floor; even a perfect model cannot beat it

Diagnose with data size

Learning curves: does more data help?

Plot training and validation error as the training set grows. Two telltale shapes appear. High bias: both curves flatten at a high error and hug each other, so more data won't help, you need a richer model. High variance: a big gap between low training error and high validation error, and more data will help close it.

Animation · learning curves grow with the dataset

n = -

Drag the model from high bias (too simple) to high variance (too flexible) and watch the two error curves redraw as the dataset grows from a handful of points to many. Read the gap: a stubborn gap means more data helps; converged-but-high means you've hit the model's ceiling.

model capacity low Low capacity = high bias; high capacity = high variance.

Worked example: precision vs recall by hand

A test flags 100 emails as spam. 80 really are spam ($TP=80$); 20 were innocent ($FP=20$). It also missed 40 real spams ($FN=40$).

Precision $=\dfrac{80}{80+20}=0.80$. Recall $=\dfrac{80}{80+40}=0.667$. $F_1=\dfrac{2(0.8)(0.667)}{0.8+0.667}\approx 0.727$.

It's precise (few false alarms) but misses a third of real spam (lower recall). Lowering the threshold would catch more spam at the cost of more false alarms.

Level 4

Expert: beyond the basics

Clustering, neighbours, trees, scaling, cross-validation, real code, and a final challenge.

Unsupervised · clustering

k-means: find groups with no labels at all.

You only have points, no answers. k-means drops $k$ centre points (centroids), then repeats two steps: assign each point to its nearest centroid, and move each centroid to the average of its points. Repeat until nothing moves.

Inertia = how tight

We score a clustering by inertia: the total squared distance from every point to its centroid. Lower inertia = tighter, better-placed clusters.

Game · "Place the centroids" (k-means)

inertia -

Click on the canvas to drop $k$ centroids where you think the cluster centres are. Then press Run iteration to do one assign-and-update step, or Auto-run to converge. Try to reach the lowest inertia, and beat your best.

k (centroids) 3 best inertia: -

Supervised · neighbours

k-NN: you are who your neighbours are.

To classify a new point, k-nearest-neighbours looks at the $k$ closest known points and takes a majority vote. No training at all, just remember the data and measure distances at prediction time.

Choosing k

k = 1: follows every point exactly, so it is jumpy and sensitive to noise.
large k: smoother boundary, but can ignore small real clusters.
odd k: avoids ties in two-class votes.

Game · "Vote of the neighbours" (k-NN)

drag the query point

The big hollow dot is a query with unknown class. Drag it around. The $k$ nearest known points are circled and they vote on its colour. Change $k$ and watch the prediction flip near the boundary.

k (neighbours) 3 -

Supervised · the widest street

Support vector machines: find the boundary with the widest margin.

Many lines can separate two classes, but which is best? An SVM picks the one that leaves the widest empty street between the classes. The points that touch the edges of that street are the support vectors; they alone define the boundary. A fatter margin generalises better to new points.

The margin

For a line $w^\top x + b = 0$, the margin width is $\dfrac{2}{\lVert w\rVert}$. SVM maximises that while keeping every point on its correct side: a "maximum-margin" classifier.

\[ \min_{w,\,b}\;\; \tfrac{1}{2}\lVert w\rVert^2 \;+\; C\sum_{i=1}^{n}\max\!\bigl(0,\; 1 - y_i(w^\top x_i + b)\bigr) \]

In words: real data rarely splits cleanly, so the soft-margin SVM lets points violate the street for a price. The first term widens the margin (small $\lVert w\rVert$); the hinge loss term charges for every point inside the margin or on the wrong side. The knob $C$ sets the exchange rate between the two.

$\max(0,\,1 - y_i(w^\top x_i + b))$ - hinge loss: zero once a point is safely past its margin, growing linearly as it crosses
$y_i \in \{-1,+1\}$ - the label, so a correct, confident point makes $y_i(w^\top x_i+b) \ge 1$
large $C$ - punish violations hard: narrow margin, risk overfitting (toward hard-margin)
small $C$ - tolerate violations: wider, smoother margin, more bias, less variance

Game · "Widen the street" (max-margin SVM)

margin -

Two classes sit on the plane. Rotate and shift a separating line (or drag on the chart) so it splits them and leaves the widest possible gap. The shaded band is your margin; the circled points are the nearest on each side (the support vectors). Score the widest correct margin, then hit Optimal to see the SVM's answer.

angle 0° offset 0 best correct margin: -

Unsupervised · dimensionality reduction

PCA: squash 2D into 1D, keeping as much spread as possible.

Sometimes two features are really telling you one thing. Principal component analysis finds the single direction along which the data varies most (the principal axis) and projects every point onto it. You trade a dimension for simplicity while keeping the maximum possible variance (information).

Variance captured

Project onto a direction $u$; the captured variance is $u^\top \Sigma\, u$. PCA picks the $u$ that maximises it: the top eigenvector of the covariance $\Sigma$. Spin the line below and watch the captured fraction peak exactly on that axis.

Interactive · project onto the principal axis

variance captured -

The cloud is a stretched 2D blob. Rotate the projection line; each point drops a foot onto it (its 1D coordinate). The readout shows what fraction of the total spread that single line captures. Hit Snap to PC1 to jump to the optimal direction PCA would choose.

Scree chart (right): PCA's two axes ranked by the spread they hold, and PC1 always captures the most. The marker shows what fraction of PC1's variance your current line is reaching.

projection angle 20° The best line maximises the spread of the projected feet.

Wisdom of the crowd

Ensembles & bagging: many weak models vote into a strong one.

One shallow tree is jagged and unreliable. But train many of them, each on a slightly different random sample of the data, and let them vote: the wobbles cancel out and a smooth, stable boundary emerges. That's bagging (a random forest is exactly this). The crowd beats any single member.

Why it works

Averaging independent, noisy guesses keeps the signal and cancels the noise; it cuts variance without adding much bias. More voters → a smoother, steadier decision boundary.

Animation · weak models vote into a smooth boundary

voters 1

Each faint line is one weak classifier trained on a random resample of the data. Add more voters and watch their jagged individual guesses average into the bold, smooth ensemble boundary. The background tint shows how confidently the crowd votes each region.

number of voters 1 One voter is jagged; the crowd is smooth.

Expert · how a tree carves space

Decision-tree splitting: slicing the plane into pure boxes.

A decision tree classifies by asking yes/no questions like "is $x > 0.4$?" Each question is an axis-aligned cut that splits a region in two. Deeper trees keep cutting until each box is (nearly) one colour. The depth slider animates the cuts appearing, and shows how a deep tree can carve tiny boxes around single points (overfitting).

Greedy & axis-aligned

Each split picks the cut that best separates the classes (purest children).
Shallow tree: few big boxes, simple, may underfit.
Deep tree: many tiny boxes, can memorise noise.

Interactive · grow the tree, watch the regions split

depth 1 · purity -

Two coloured classes sit on the plane. Raise the depth and the tree recursively cuts the space into axis-aligned regions, each shaded by its majority class. Watch purity climb toward 100%, and notice the boxes getting suspiciously small at high depth.

tree depth 1 Each level doubles the number of regions.

Decision trees & complexity

A tree asks yes/no questions, splitting the data until each leaf is pure. Deeper trees fit more, and overfit more. Drag the depth.

max depth 2

Feature scaling

If one feature spans 0 to 1000 and another 0 to 1, distance-based models obsess over the big one. Standardize: $x' = \dfrac{x-\mu}{\sigma}$. Toggle scaling.

standardize features

k-fold cross-validation

Instead of one test set, split data into $k$ folds. Train on $k-1$, test on the held-out fold, rotate, and average. A more honest score.

k folds 5

The same ideas in real code

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split

# split: keep 20% hidden to test generalization
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

# regression: predict a number  (y_hat = w·x + b)
reg = LinearRegression().fit(X_train, y_train)
print(reg.score(X_test, y_test))     # R^2 on unseen data

# classification: predict a class via the sigmoid
clf = LogisticRegression().fit(X_train, y_train)
print(clf.predict_proba(X_test)[:5]) # probabilities, not just labels

import torch, torch.nn as nn

model = nn.Linear(n_features, 1)          # y_hat = w·x + b
opt   = torch.optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss()                    # regression loss

for epoch in range(200):
    opt.zero_grad()                       # clear old gradients
    y_hat = model(X_train)                # forward pass
    loss  = loss_fn(y_hat, y_train)       # how wrong?
    loss.backward()                       # gradients of loss w.r.t w,b
    opt.step()                            # w <- w - lr * grad  (gradient descent)
# for classification: swap to nn.BCEWithLogitsLoss + a sigmoid output

train_test_split - hides part of the data so the score reflects unseen examples, not memorisation.

LinearRegression / nn.Linear - both fit $\hat y = wx + b$; sklearn solves it directly, PyTorch via gradient descent.

LogisticRegression - squashes the line through a sigmoid into a probability, then thresholds for a class.

the training loop - forward, measure loss, backpropagate gradients, take a descent step, repeat.

Worked example: one k-means step by hand

Six points on a line: $\{1,\,2,\,3,\,9,\,10,\,11\}$, and $k=2$. Start the centroids badly at $c_1 = 2$ and $c_2 = 4$.

Assign: each point joins its nearest centroid. Points $1,2,3$ are closer to $c_1=2$; points $9,10,11$ are closer to $c_2=4$ (e.g. $|9-4|=5 < |9-2|=7$). So $A_1=\{1,2,3\}$, $A_2=\{9,10,11\}$.

Move: each centroid jumps to the mean of its members: $c_1 = \tfrac{1+2+3}{3} = 2$, $c_2 = \tfrac{9+10+11}{3} = 10$.

Check: re-assigning with $c_1=2,\,c_2=10$ changes nothing, so the algorithm has converged in one update. Final inertia $= (1{-}2)^2+(2{-}2)^2+(3{-}2)^2 + (9{-}10)^2+(10{-}10)^2+(11{-}10)^2 = 4$.

k-means only ever lowers inertia, so it always converges, but to a local minimum that depends on the start. Bad seeds give bad clusters; that is why practice uses k-means++ seeding (spread the initial centroids apart) and several restarts, keeping the lowest-inertia run.

Challenge · earn the Machine Learning badge

0 / 22

Frontier · research-grade

The theory under classical ML.

The levels above are the practitioner's toolkit. This tier is the why: when does fitting the data generalize, what error is reducible, how kernels buy nonlinearity for free, how boosting turns weak rules strong, and why k-means always converges. Each topic is a guided lesson with step-through proofs, a worked example, a visualization, and citations.