An embedding is a tiny address label in meaning-space: it turns a word, an
image, or a robot state into an arrow. Arrows that point the same way mean similar things.
Open the vault and learn how machines store and search meaning.
By the end, you will be able to explain:Magnitude + directionDot product & cosine similarityWhy embeddings store meaningHow vector search retrieves
meaning-space
Level 1 · Beginner
What even is a vector?
A vector is just an arrow. It has a length (how big) and a
direction (which way it points). That's the whole idea.
The everyday analogy: map coordinates
Think of a treasure map. "Go 3 steps east and 2 steps north" is a vector:
two numbers, (3, 2), that together point you somewhere. Computers describe
words the same way. The word king might live at coordinates like
(0.8, 0.6) in a pretend two-number "meaning-space".
Words that mean similar things get put close together: king and
queen sit near each other, while banana is parked far away. That closeness
is what lets a machine "feel" that two things are related.
One-line version: an embedding is a list of numbers that drops a word onto
a map so similar words become neighbours.
Arrow = length + direction
The arrow's length says how strong, the angle says which way.
Try it: drag the arrows in meaning-space
interactive
Grab the dot at the tip of vector A or
vector B and drag. Watch the live readout update.
Worked example
Where do these words land?
Pretend meaning-space has two axes: royalty (→) and cuteness (↑). Then:
king and queen are close, so the machine treats them as related. kitten
points a totally different way, so it's "far" in meaning even though all three are short words.
Level 2 · Intermediate
Measuring how alike two arrows are
If similar things are close, we need a number for "close". Three tools do the job:
the dot product, cosine similarity, and Euclidean distance.
The dot product, in plain words
The dot product asks: "how much do these two arrows agree?" Multiply their
matching numbers and add up the results. If both point the same way it's big and positive.
If they point opposite ways it's negative. If they're at right angles it's exactly zero.
Cosine vs. Euclidean
Cosine similarity only cares about the angle between arrows; it
ignores length. Euclidean distance is the straight-line gap between the two
tips, so it cares about length too. Normalizing means shrinking every arrow
to length 1 so only direction is left; after that, cosine and distance say the same thing.
Rule of thumb: for embeddings we usually compare direction, so
cosine similarity is the favourite.
Projection: the shadow of A on B
The dot product is the length of A's shadow on B, scaled by |B|.
In words: multiply each pair of matching coordinates and add them up; this
also equals the two lengths multiplied together, times the cosine of the angle between them.
\(a_i, b_i\): the \(i\)-th coordinate of each vector
\(n\): number of dimensions (2 here, 768 for a real embedding)
\(\lVert a\rVert\): the length (magnitude) of \(a\)
In words: subtract the vectors coordinate by coordinate, square each gap,
add them, and take the square root: the straight-line distance between the two tips.
The bridge between the two metrics. Expand the squared distance with the
dot-product identity above. The cross-term carries a \(\cos\theta\), so distance and cosine are
two views of the same geometry. When both vectors are normalized to length 1, the lengths
drop out and squared distance becomes exactly \(2(1-\cos\theta)\).
Why this matters: on unit vectors, ranking by smallest Euclidean distance is
identical to ranking by largest cosine similarity; they never disagree. That is why vector
databases normalize once at index time and then treat "nearest" and "most similar" as the same query.
\(\lVert a-b\rVert^{2}\): squared Euclidean distance (no square root needed for ranking)
\(-2\lVert a\rVert\lVert b\rVert\cos\theta\): the cross-term, the only place the angle enters
\(2(1-\cos\theta)\): the unit-vector case: ranges from \(0\) (same direction) to \(4\) (opposite)
One dot product, stacked into a matrix. Comparing one query against many stored
vectors is just many dot products at once, a matrix multiply. Row \(i\), column \(j\) of \(QK^{\top}\)
is the dot product \(q_i\!\cdot\!k_j\): exactly the similarity score from the top of this tier. This is
how the Transformers room computes attention: every token's query scores every token's key.
The \(\sqrt{d}\) and the softmax: in high dimensions raw dot products grow with the
dimension \(d\), so dividing by \(\sqrt{d}\) keeps the scores in a sane range (the same
curse-of-dimensionality intuition from the callout above). softmax then turns those
scores into weights that sum to 1: a temperature-scaled, normalized version of the cosine ranking
you already know.
\(Q,K,V\): matrices whose rows are query / key / value vectors
\(QK^{\top}\): all pairwise dot products in one shot (an \(n\times n\) score grid)
\(\sqrt{d}\): scale factor that tames score growth in high dimensions
\(\operatorname{softmax}\): converts scores to a probability-weighted blend
Does this still work in 768 dimensions?
Yes, and that's the magic. A real model like BERT stores each token as a vector with
768 numbers (GPT-style models go to thousands). You can't picture a 768-arrow,
but every formula above just sums over more terms. The geometry is identical; only the count of
coordinates changes. More dimensions give the model more independent "meaning knobs" to separate
concepts.
The curse of dimensionality: in very high dimensions almost every random pair of
vectors ends up nearly perpendicular, so raw Euclidean distances bunch together and lose contrast.
That's a big reason embeddings lean on cosine (angle) instead of raw distance.
Worked example
Cosine in 4 dimensions
Let a = (1, 0, 1, 0) and b = (1, 1, 0, 0).
Dot product: 1·1 + 0·1 + 1·0 + 0·0 = 1.
Lengths: |a| = √2, |b| = √2.
Cosine: 1 / (√2·√2) = 0.5 → angle = 60°.
They share one active dimension out of two each, giving a moderate similarity of 0.5.
Try it: watch random vectors go perpendicular
interactive
Drag the dimension slider. Each bar counts how many random pairs landed at
that cosine similarity. As the dimension climbs, the whole pile slides into the middle: in high
dimensions almost every random pair is nearly perpendicular (cosine ≈ 0).
Why it matters: when every pair looks equally "far", raw distance
stops ranking neighbours well, the headline reason embeddings lean on cosine, and the reason the
Johnson-Lindenstrauss result in the Frontier tier is so useful.
Level 4 · Expert
Vector search: finding meaning at scale
This is what powers semantic search, RAG, recommendations, and image lookup:
store millions of embeddings, then fetch the nearest neighbours of a query.
Nearest-neighbour retrieval
Embed everything once. When a query arrives, embed it too, then return the stored vectors with
the highest cosine similarity: the top-k nearest neighbours. Cosine is popular
for text embeddings because models are trained so that direction carries meaning while
length often just reflects word frequency or magnitude noise.
Try it: semantic search over a tiny word map
interactive
Pick a query word and a k. The vault highlights the top-k nearest
words by cosine similarity and draws the ranking.
PyTorch idea: cosine similarity & a tiny index
import torch
import torch.nn.functional as F
# A "database" of 5 embeddings, each 768-dim (random stand-ins).
db = torch.randn(5, 768)
query = torch.randn(768)
# Cosine similarity of the query against every row.
sims = F.cosine_similarity(query.unsqueeze(0), db, dim=1) # shape: (5,)
# Top-3 nearest neighbours (an in-memory "index").
scores, idx = sims.topk(3)
print(idx.tolist(), scores.tolist())
# Tip: pre-normalize once, then a matrix-multiply IS the cosine search.
db_n = F.normalize(db, dim=1)
q_n = F.normalize(query, dim=0)
sims_fast = db_n @ q_n # same numbers, much faster at scale
topk returns the highest scores plus their row indices: the retrieved items.
normalize then matmul is the trick real vector databases use: scale every vector to length 1 once, and a single dot product becomes the cosine score.
What real embedding models look like in 2026
The toy 2D map above scales straight to production. Modern open text embedders
(the families that top the public MTEB leaderboard) emit vectors of
768 to 4096 dimensions and are still compared with the very same cosine score,
almost always after L2-normalization, so cosine and dot product coincide exactly as derived in the
Advanced tier.
Two practical tricks now dominate at scale. Matryoshka Representation Learning (MRL)
trains one model so its leading coordinates are usable on their own: you can truncate a 1024-dim
vector to its first 256 numbers and still search well, trading a little accuracy for a big speed and
storage win (no re-embedding required). Quantization shrinks each number from a
32-bit float to 8-bit integers or even a single bit (binary embeddings), cutting memory 4 to 32× while
keeping the ranking nearly intact; a fast bit-distance pass then re-ranks the survivors with full
precision.
Takeaway: the metric never changed; it is still the cosine/dot product from this
room. What changed is engineering: fewer dimensions (MRL) and fewer bits per dimension
(quantization) make billion-vector search cheap. The deeper indexes that exploit this live in the
Frontier tier below (HNSW, product quantization).
Worked example
Why a binary embedding still ranks correctly
Binarize by sign: keep \(+1\) where a coordinate is positive, \(-1\) where it is negative. Take
q = (+1, +1, +1, +1): the binarized query.
doc1 = (+1, +1, +1, −1) agrees on 3 of 4 dimensions.
doc2 = (+1, −1, −1, +1) agrees on only 2 of 4.
For \(\pm 1\) vectors the dot product is (agreements) − (disagreements):
doc1 → 3−1 = +2, doc2 → 2−2 = 0. So doc1 outranks
doc2, using nothing but bit comparisons. Equivalently, Hamming distance (the count of
flipped bits) gives the same order: doc1 differs in 1 bit, doc2 in 2.
That is how 1-bit embeddings retrieve fast and then hand the top candidates back to full-precision
cosine for an exact re-rank.
Challenge
Lock it in: 11-question vault check
1. Cosine similarity mainly measures the ___ between two vectors.
2. Two vectors point in exactly opposite directions. Their cosine similarity is:
3. In semantic search, "top-k retrieval" returns the k items that are:
4. "Normalizing" an embedding means rescaling it so that:
5. The dot product \(\mathbf{a}\cdot\mathbf{b}\) of two vectors is computed by:
6. For two unit-length vectors, \(\lVert\mathbf{a}-\mathbf{b}\rVert^2 = 2(1-\cos\theta)\). This means ranking neighbors by smallest Euclidean distance gives:
7. When all vectors are pre-normalized to unit length, cosine similarity reduces exactly to:
8. In a Transformer, the raw attention score between a query \(\mathbf{q}\) and key \(\mathbf{k}\) is essentially:
9. HNSW lets billion-scale vector search run in roughly log-time because it:
10. Product quantization (PQ) shrinks an index mainly by:
11. Contrastive training with the InfoNCE loss shapes the embedding space so that:
Frontier · research-grade
How embeddings are made & searched at scale.
The levels above cover vectors, cosine similarity, and toy search. This tier is the real machinery:
why high-dimensional vectors compress (Johnson-Lindenstrauss), how billion-scale search runs in
log-time (HNSW) and bytes (product quantization), how embeddings are learned (contrastive
InfoNCE), and how they ground LLMs (RAG). Each topic has step-through proofs, a
worked example, a visualization, and citations.