Embeddings & Vector Search

Vectors are arrows with meaning locked inside.

An embedding is a tiny address label in meaning-space: it turns a word, an image, or a robot state into an arrow. Arrows that point the same way mean similar things. Open the vault and learn how machines store and search meaning.

Enter the vault Jump to vector search

By the end, you will be able to explain: Magnitude + direction Dot product & cosine similarity Why embeddings store meaning How vector search retrieves

meaning-space

Level 1 · Beginner

What even is a vector?

A vector is just an arrow. It has a length (how big) and a direction (which way it points). That's the whole idea.

The everyday analogy: map coordinates

Think of a treasure map. "Go 3 steps east and 2 steps north" is a vector: two numbers, (3, 2), that together point you somewhere. Computers describe words the same way. The word king might live at coordinates like (0.8, 0.6) in a pretend two-number "meaning-space".

Words that mean similar things get put close together: king and queen sit near each other, while banana is parked far away. That closeness is what lets a machine "feel" that two things are related.

One-line version: an embedding is a list of numbers that drops a word onto a map so similar words become neighbours.

Arrow = length + direction

The arrow's length says how strong, the angle says which way.

Try it: drag the arrows in meaning-space

interactive

Grab the dot at the tip of vector A or vector B and drag. Watch the live readout update.

Worked example

Where do these words land?

Pretend meaning-space has two axes: royalty (→) and cuteness (↑). Then:

king = (0.9, 0.2): very royal, not very cute.
kitten = (0.1, 0.95): barely royal, extremely cute.
queen = (0.85, 0.35) sits right next to king.

king and queen are close, so the machine treats them as related. kitten points a totally different way, so it's "far" in meaning even though all three are short words.

Level 2 · Intermediate

Measuring how alike two arrows are

If similar things are close, we need a number for "close". Three tools do the job: the dot product, cosine similarity, and Euclidean distance.

The dot product, in plain words

The dot product asks: "how much do these two arrows agree?" Multiply their matching numbers and add up the results. If both point the same way it's big and positive. If they point opposite ways it's negative. If they're at right angles it's exactly zero.

Cosine vs. Euclidean

Cosine similarity only cares about the angle between arrows; it ignores length. Euclidean distance is the straight-line gap between the two tips, so it cares about length too. Normalizing means shrinking every arrow to length 1 so only direction is left; after that, cosine and distance say the same thing.

Rule of thumb: for embeddings we usually compare direction, so cosine similarity is the favourite.

Projection: the shadow of A on B

The dot product is the length of A's shadow on B, scaled by |B|.

Try it: tune two vectors, read every metric live

interactive

A angle 35° A length 80 B angle 78° B length 60

Worked example

Compute a dot product by hand

Let a = (3, 4) and b = (4, 0).

Dot product: 3·4 + 4·0 = 12.
Lengths: |a| = √(9+16) = 5, |b| = √16 = 4.
Cosine: 12 / (5·4) = 0.6 → angle ≈ 53°.
Euclidean distance: √((3−4)²+(4−0)²) = √17 ≈ 4.12.

Cosine of 0.6 means "fairly similar direction" even though the arrows have different lengths.

Level 3 · Advanced

The real math behind similarity

Same intuition, now written the way papers write it, and it works identically whether the arrow has 2 numbers or 768.

\[ a\cdot b=\sum_{i=1}^{n} a_i b_i=\lVert a\rVert\,\lVert b\rVert\cos\theta \]

In words: multiply each pair of matching coordinates and add them up; this also equals the two lengths multiplied together, times the cosine of the angle between them.

\(a_i, b_i\): the \(i\)-th coordinate of each vector
\(n\): number of dimensions (2 here, 768 for a real embedding)
\(\lVert a\rVert\): the length (magnitude) of \(a\)
\(\theta\): the angle between the two arrows

\[ \cos\theta=\frac{a\cdot b}{\lVert a\rVert\,\lVert b\rVert}=\frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2}\,\sqrt{\sum_i b_i^2}} \]

In words: divide the dot product by both lengths and you cancel out size, leaving a pure direction score between -1 (opposite) and +1 (identical).

\(\cos\theta\): cosine similarity, the headline embedding metric
\(a\cdot b\): the dot product from above
\(\lVert a\rVert\,\lVert b\rVert\): the normaliser that removes length

Drag the slider to feel how the angle controls the cosine.

angle θ

\[ \lVert a-b\rVert=\sqrt{\sum_{i=1}^{n}(a_i-b_i)^2} \]

In words: subtract the vectors coordinate by coordinate, square each gap, add them, and take the square root: the straight-line distance between the two tips.

\(a_i-b_i\): the gap in dimension \(i\)
\(\lVert a-b\rVert\): Euclidean (L2) distance

\[ \lVert a-b\rVert^{2}=\lVert a\rVert^{2}+\lVert b\rVert^{2}-2\,\lVert a\rVert\,\lVert b\rVert\cos\theta \;\xrightarrow[\;\lVert a\rVert=\lVert b\rVert=1\;]{}\; 2\bigl(1-\cos\theta\bigr) \]

The bridge between the two metrics. Expand the squared distance with the dot-product identity above. The cross-term carries a \(\cos\theta\), so distance and cosine are two views of the same geometry. When both vectors are normalized to length 1, the lengths drop out and squared distance becomes exactly \(2(1-\cos\theta)\).

Why this matters: on unit vectors, ranking by smallest Euclidean distance is identical to ranking by largest cosine similarity; they never disagree. That is why vector databases normalize once at index time and then treat "nearest" and "most similar" as the same query.

\(\lVert a-b\rVert^{2}\): squared Euclidean distance (no square root needed for ranking)
\(-2\lVert a\rVert\lVert b\rVert\cos\theta\): the cross-term, the only place the angle enters
\(2(1-\cos\theta)\): the unit-vector case: ranges from \(0\) (same direction) to \(4\) (opposite)

\[ S=QK^{\top}\quad\Longrightarrow\quad \mathrm{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d}}\right)V \]

One dot product, stacked into a matrix. Comparing one query against many stored vectors is just many dot products at once, a matrix multiply. Row \(i\), column \(j\) of \(QK^{\top}\) is the dot product \(q_i\!\cdot\!k_j\): exactly the similarity score from the top of this tier. This is how the Transformers room computes attention: every token's query scores every token's key.

The \(\sqrt{d}\) and the softmax: in high dimensions raw dot products grow with the dimension \(d\), so dividing by \(\sqrt{d}\) keeps the scores in a sane range (the same curse-of-dimensionality intuition from the callout above). softmax then turns those scores into weights that sum to 1: a temperature-scaled, normalized version of the cosine ranking you already know.

\(Q,K,V\): matrices whose rows are query / key / value vectors
\(QK^{\top}\): all pairwise dot products in one shot (an \(n\times n\) score grid)
\(\sqrt{d}\): scale factor that tames score growth in high dimensions
\(\operatorname{softmax}\): converts scores to a probability-weighted blend

Does this still work in 768 dimensions?

Yes, and that's the magic. A real model like BERT stores each token as a vector with 768 numbers (GPT-style models go to thousands). You can't picture a 768-arrow, but every formula above just sums over more terms. The geometry is identical; only the count of coordinates changes. More dimensions give the model more independent "meaning knobs" to separate concepts.

The curse of dimensionality: in very high dimensions almost every random pair of vectors ends up nearly perpendicular, so raw Euclidean distances bunch together and lose contrast. That's a big reason embeddings lean on cosine (angle) instead of raw distance.

Worked example

Cosine in 4 dimensions

Let a = (1, 0, 1, 0) and b = (1, 1, 0, 0).

Dot product: 1·1 + 0·1 + 1·0 + 0·0 = 1.
Lengths: |a| = √2, |b| = √2.
Cosine: 1 / (√2·√2) = 0.5 → angle = 60°.

They share one active dimension out of two each, giving a moderate similarity of 0.5.

Try it: watch random vectors go perpendicular

interactive

Drag the dimension slider. Each bar counts how many random pairs landed at that cosine similarity. As the dimension climbs, the whole pile slides into the middle: in high dimensions almost every random pair is nearly perpendicular (cosine ≈ 0).

Dimension d 2

Why it matters: when every pair looks equally "far", raw distance stops ranking neighbours well, the headline reason embeddings lean on cosine, and the reason the Johnson-Lindenstrauss result in the Frontier tier is so useful.

Level 4 · Expert

Vector search: finding meaning at scale

This is what powers semantic search, RAG, recommendations, and image lookup: store millions of embeddings, then fetch the nearest neighbours of a query.

Nearest-neighbour retrieval

Embed everything once. When a query arrives, embed it too, then return the stored vectors with the highest cosine similarity: the top-k nearest neighbours. Cosine is popular for text embeddings because models are trained so that direction carries meaning while length often just reflects word frequency or magnitude noise.

Try it: semantic search over a tiny word map

interactive

Pick a query word and a k. The vault highlights the top-k nearest words by cosine similarity and draws the ranking.

Query word Top-k 3

PyTorch idea: cosine similarity & a tiny index

import torch
import torch.nn.functional as F

# A "database" of 5 embeddings, each 768-dim (random stand-ins).
db = torch.randn(5, 768)
query = torch.randn(768)

# Cosine similarity of the query against every row.
sims = F.cosine_similarity(query.unsqueeze(0), db, dim=1)  # shape: (5,)

# Top-3 nearest neighbours (an in-memory "index").
scores, idx = sims.topk(3)
print(idx.tolist(), scores.tolist())

# Tip: pre-normalize once, then a matrix-multiply IS the cosine search.
db_n = F.normalize(db, dim=1)
q_n  = F.normalize(query, dim=0)
sims_fast = db_n @ q_n          # same numbers, much faster at scale

What real embedding models look like in 2026

The toy 2D map above scales straight to production. Modern open text embedders (the families that top the public MTEB leaderboard) emit vectors of 768 to 4096 dimensions and are still compared with the very same cosine score, almost always after L2-normalization, so cosine and dot product coincide exactly as derived in the Advanced tier.

Two practical tricks now dominate at scale. Matryoshka Representation Learning (MRL) trains one model so its leading coordinates are usable on their own: you can truncate a 1024-dim vector to its first 256 numbers and still search well, trading a little accuracy for a big speed and storage win (no re-embedding required). Quantization shrinks each number from a 32-bit float to 8-bit integers or even a single bit (binary embeddings), cutting memory 4 to 32× while keeping the ranking nearly intact; a fast bit-distance pass then re-ranks the survivors with full precision.

Takeaway: the metric never changed; it is still the cosine/dot product from this room. What changed is engineering: fewer dimensions (MRL) and fewer bits per dimension (quantization) make billion-vector search cheap. The deeper indexes that exploit this live in the Frontier tier below (HNSW, product quantization).

Worked example

Why a binary embedding still ranks correctly

Binarize by sign: keep \(+1\) where a coordinate is positive, \(-1\) where it is negative. Take

q = (+1, +1, +1, +1): the binarized query.
doc1 = (+1, +1, +1, −1) agrees on 3 of 4 dimensions.
doc2 = (+1, −1, −1, +1) agrees on only 2 of 4.

For \(\pm 1\) vectors the dot product is (agreements) − (disagreements): doc1 → 3−1 = +2, doc2 → 2−2 = 0. So doc1 outranks doc2, using nothing but bit comparisons. Equivalently, Hamming distance (the count of flipped bits) gives the same order: doc1 differs in 1 bit, doc2 in 2. That is how 1-bit embeddings retrieve fast and then hand the top candidates back to full-precision cosine for an exact re-rank.

Challenge

Lock it in: 11-question vault check

1. Cosine similarity mainly measures the ___ between two vectors.

2. Two vectors point in exactly opposite directions. Their cosine similarity is:

3. In semantic search, "top-k retrieval" returns the k items that are:

4. "Normalizing" an embedding means rescaling it so that:

5. The dot product \(\mathbf{a}\cdot\mathbf{b}\) of two vectors is computed by:

6. For two unit-length vectors, \(\lVert\mathbf{a}-\mathbf{b}\rVert^2 = 2(1-\cos\theta)\). This means ranking neighbors by smallest Euclidean distance gives:

7. When all vectors are pre-normalized to unit length, cosine similarity reduces exactly to:

8. In a Transformer, the raw attention score between a query \(\mathbf{q}\) and key \(\mathbf{k}\) is essentially:

9. HNSW lets billion-scale vector search run in roughly log-time because it:

10. Product quantization (PQ) shrinks an index mainly by:

11. Contrastive training with the InfoNCE loss shapes the embedding space so that:

Frontier · research-grade

How embeddings are made & searched at scale.

The levels above cover vectors, cosine similarity, and toy search. This tier is the real machinery: why high-dimensional vectors compress (Johnson-Lindenstrauss), how billion-scale search runs in log-time (HNSW) and bytes (product quantization), how embeddings are learned (contrastive InfoNCE), and how they ground LLMs (RAG). Each topic has step-through proofs, a worked example, a visualization, and citations.