Math Foundations for ML

The minimal math that makes ML stop being magic: vectors as data, dot products as similarity, gradients as direction, and probability as honesty.

Starting from Zero — A Physical Intuition

Before looking at equations, let's build a physical model of ML mathematics in our minds:

Vectors as Flashlight Beams: Imagine holding a flashlight in a dark room. You can point it up/down, left/right, and forward/backward. The direction you point is a list of three coordinates—a vector. Every data point in ML (a house, a word, a credit transaction) is just a flashlight beam pointing in a specific direction in a multi-dimensional room.
Dot Product as Beam Overlap: If you point your flashlight in the same direction as a friend, your beams merge into a bright overlap. If you point it at a right angle, they don't intersect at all. The dot product is the mathematical measurement of this overlap—representing similarity.
The Gradient as a Magnetic Slope: Imagine you are dropped on a dark, foggy mountainside and must walk down to the base. You can't see the valley, but you can feel the slope of the ground under your boots. The direction of steepest slope is the gradient, and walking downhill step-by-step is gradient descent.

How much math do you actually need?

Honest calibration first: to use ML (call models, build RAG, fine-tune), you need intuitions, not proofs. To build models or read papers, you need more. This page is the intuition layer — four ideas, each explained until it's obvious, each mapped to where you'll meet it. Nothing here requires more than Level 1 programming and high-school algebra.

The map:

  Linear algebra  →  how data and models are REPRESENTED (vectors, matrices)
  Calculus        →  how models LEARN (gradients = which way to nudge)
  Probability     →  how models express UNCERTAINTY (distributions, likelihood)
  Statistics      →  how you know it WORKED (sampling, significance, baselines)

Linear algebra: data is vectors, models are matrices

Vectors — a point is a list of numbers

A vector is just an ordered list of numbers — and the leap that unlocks all of ML: anything can be one. A house is [area, bedrooms, age] = [1200, 3, 15]. A word, after embedding, is 1,536 numbers. An image is pixel values. Once data is vectors, "similar things" becomes "nearby points," and geometry becomes your query language.

The dot product — the most important operation in ML

Multiply two vectors element-by-element and sum:

a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ

[2, 1, 0] · [3, 1, 4] = 6 + 1 + 0 = 7

Two readings, both essential:

Similarity. When two vectors point the same way, their dot product is large; perpendicular → zero; opposite → negative. Normalize by length and you get cosine similarity — the number your vector database computes a billion times a day, and the "attention score" inside transformers. Semantic search is dot products.
A weighted sum. features · weights = each feature times how much it matters, summed — which is literally what a linear model predicts and what each neuron in a neural network computes before its activation. A neuron is a dot product with an opinion.

Matrices — many dot products at once

A matrix is a grid of numbers; matrix-times-vector is "take the dot product of the vector with every row." So one matrix multiply runs every neuron in a layer simultaneously — and that's the whole secret of why GPUs run AI: graphics cards are matrix-multiply machines, and a neural network is a stack of matrix multiplies. "Model weights" — the gigabytes in a model file — are these matrices' entries; a "70B-parameter model" has 70 billion such numbers (fine-tuning nudges them; LoRA adds small matrices beside them).

Think it through like the interview

Don't just plug vectors into equations — derive why we normalise them for semantic matching.

Think it through: Dot Product vs Cosine SimilarityLinear Algebra & Representation0/3 stages

PROBLEMDesign a similarity metric to match user query embeddings. Explain why raw dot products fail when vector magnitudes differ.

1
Evaluate the raw dot product
“Given query vector q = [1, 0] and two document vectors d1 = [2, 0] and d2 = [0, 100], which vector has a higher raw dot product with q? Does this match semantic similarity?”
2
Find the scaling bug
“What happens to the dot product when a document is extremely long, containing the same words repeated multiple times (increasing its vector length)?”
unlocks after the stage above
3
Normalize to Cosine Similarity
“How do we remove the magnitude bias from our search metric to compare only semantic directions?”
unlocks after the stage above

Calculus: the gradient is a compass

Forget integrals; ML needs one idea. A model's loss is a single number measuring how wrong it currently is (the error in training). Training is: change the weights to make the loss smaller. With 70 billion weights, which ones, and which way?

The derivative answers it for one weight: "if I nudge this weight up a hair, does the loss go up or down, and how steeply?" The gradient is just that answer for every weight at once — a vector of slopes, pointing in the direction of steepest increase. So:

GRADIENT DESCENT (all of deep learning's training, one line):

    weights ← weights − learning_rate × gradient

"Take a small step DOWNHILL, repeat a few million times."

  - learning_rate = step size: too big → you leap over valleys and
    diverge; too small → training takes geologic time
    (the knob you met in fine-tuning)
  - backpropagation = the bookkeeping algorithm that computes the
    gradient efficiently through stacked layers (chain rule, applied
    relentlessly — the "how" of neural-networks training)

That hiking-downhill-in-fog picture — you can't see the valley, only feel the slope under your feet — is genuinely the mental model researchers use. Loss curves flattening = the terrain leveled off; "stuck in a local minimum" = a valley that isn't the deepest one.

Probability: models that admit uncertainty

A classifier doesn't say "spam" — it says P(spam) = 0.93, a probability distribution over outcomes (numbers ≥ 0 summing to 1). An LLM's entire output is one: given the text so far, a distribution over every possible next token ("the cat sat on the mat 0.41, floor 0.18, …" — LLMs are next-token distributions, sampled). Three working ideas:

Conditional probability — P(A given B): the "given" is what changes everything. P(spam) might be 0.1; P(spam given the word "lottery") might be 0.9. Models are conditional-probability machines: P(label | features), P(next token | context).
Expectation — the probability-weighted average: what you'd get on average over many draws. Loss functions are expectations over the training data; "expected revenue per user" is the same math in a business meeting.
Likelihood & training — "maximize likelihood" = choose weights under which the actual observed data would have been most probable. Cross-entropy loss — the loss of essentially every classifier and LLM — is this, in logarithms. When you read "training minimizes cross-entropy," hear: make the truth look probable.

And the one distribution to know by name: the normal (Gaussian) — the bell curve that measurement noise and many natural quantities follow, the default assumption behind "mean ± standard deviation."

Statistics: knowing whether it worked

The unglamorous quarter that saves real money:

Sampling bias — your model is only as general as the data was representative (the fine-tuning distribution-mismatch bug). Train on daytime photos, fail at night.
Train/test discipline — accuracy on data the model has seen is memory, not learning; held-out evaluation is the entire epistemology of ML (classical ML makes this concrete).
Baselines — "94% accurate" means nothing until you know the dumbest model's score (predict the majority class: 91%?). Always ask "compared to what?"
Variance & significance — rerun training with a different random seed and accuracy moves a point; a 0.5% "improvement" might be noise. A/B tests (the product version) exist because per-user variance swamps small effects.

Gradient Descent in Code

Here is a complete, dependency-free Python implementation of a 1D gradient descent solver fitting the line $y = wx + b$ to coordinate points:

Python

from typing import List, Tuple

def compute_loss(data: List[Tuple[float, float]], w: float, b: float) -> float:
    # Mean Squared Error (MSE)
    total_error = sum((y - (w * x + b)) ** 2 for x, y in data)
    return total_error / len(data)

def step_gradient(
    data: List[Tuple[float, float]], 
    w_current: float, 
    b_current: float, 
    learning_rate: float
) -> Tuple[float, float]:
    w_gradient = 0.0
    b_gradient = 0.0
    N = len(data)
    
    # Calculate slopes (derivatives of loss with respect to w and b)
    for x, y in data:
        prediction = w_current * x + b_current
        error = y - prediction
        w_gradient += -2 * x * error
        b_gradient += -2 * error
        
    # Nudge parameters opposite to gradient direction
    new_w = w_current - (w_gradient / N) * learning_rate
    new_b = b_current - (b_gradient / N) * learning_rate
    return new_w, new_b

# 1. Target line: y = 2x + 1 (with minor noise)
points = [(1.0, 3.1), (2.0, 4.9), (3.0, 7.2), (4.0, 8.8), (5.0, 11.1)]

# 2. Start from random initial weights and set learning rate
w, b = 0.0, 0.0
lr = 0.01

print("Training Loop:")
for epoch in range(1, 101):
    w, b = step_gradient(points, w, b, lr)
    if epoch % 20 == 0:
        loss = compute_loss(points, w, b)
        print(f"Epoch {epoch:03d} | Loss: {loss:.4f} | w: {w:.3f} | b: {b:.3f}")

Bayes' Theorem — The Foundation of Probabilistic Reasoning

Bayes' theorem is one of the most important equations in ML and data science. It tells you how to update your belief when you get new evidence.

Bayes' Theorem:

   P(A | B) = P(B | A) × P(A)
              ─────────────────
                    P(B)

Read: "The probability of A given B equals
       the probability of B given A, times the prior probability of A,
       divided by the probability of B."

The three parts explained:

P(A) = Prior probability — what you believed before seeing any evidence
P(B | A) = Likelihood — how probable is the evidence if A is true?
P(A | B) = Posterior — what you now believe after seeing the evidence

Real ML example — Spam Filter:

Prior:              P(spam) = 0.10          (10% of emails are spam)
Likelihood:         P(word "lottery" | spam) = 0.80   (80% of spam contains "lottery")
P(word "lottery"):  P("lottery") = 0.05    (5% of all emails contain "lottery")

Posterior:
P(spam | "lottery") = (0.80 × 0.10) / 0.05 = 0.08 / 0.05 = 1.6 → normalized: high!

→ Seeing "lottery" updates our belief from 10% to 80% spam probability.

Where it appears in ML:

Naive Bayes classifier — assumes features are conditionally independent given the class
Bayesian optimization — finds the best hyperparameters by modeling the unknown function
Probabilistic models — any model that outputs a probability is applying Bayes implicitly
A/B testing — Bayesian A/B testing updates belief about which variant is better

Python

# Naive Bayes from scratch — text spam classification
def naive_bayes_predict(word_in_email: bool, p_spam=0.10, p_word_given_spam=0.80, p_word_given_ham=0.01) -> float:
    """
    Apply Bayes theorem to compute P(spam | word_present).
    """
    p_word = (p_word_given_spam * p_spam) + (p_word_given_ham * (1 - p_spam))
    p_spam_given_word = (p_word_given_spam * p_spam) / p_word
    return p_spam_given_word

print(f"P(spam | 'lottery' in email): {naive_bayes_predict(True):.3f}")  # ~0.615
print(f"P(spam | no 'lottery'): close to prior 10%")

Information Theory — Entropy and Information Gain

Entropy measures uncertainty / disorder in a dataset. It's the math behind Decision Trees splitting decisions.

Entropy H(S) = -Σ pᵢ × log₂(pᵢ)

Where pᵢ = proportion of class i in the dataset S.

Examples:
  Pure node:  100% cats     → H = -(1.0 × log₂(1.0)) = 0.0   (no uncertainty!)
  Coin flip:  50% cats, 50% dogs → H = -(0.5 × log₂(0.5) + 0.5 × log₂(0.5)) = 1.0
  3 classes equal: H = -(3 × 0.333 × log₂(0.333)) = 1.585

Entropy = 0 means a perfect, pure leaf node in a Decision Tree. Higher entropy = more impurity = worse split.

Information Gain = how much entropy a split reduces:

Information Gain = H(parent) - Σ (|child| / |parent|) × H(child)

Decision trees pick the feature and threshold that maximizes Information Gain.
→ This is the mathematical reason trees split the way they do.

Python

import math

def entropy(labels: list) -> float:
    """Calculate Shannon entropy of a list of class labels."""
    if not labels:
        return 0.0
    total = len(labels)
    counts = {}
    for label in labels:
        counts[label] = counts.get(label, 0) + 1
    return -sum((count/total) * math.log2(count/total) for count in counts.values())

def information_gain(parent: list, left_child: list, right_child: list) -> float:
    """How much entropy is reduced by this split?"""
    n = len(parent)
    weighted_child_entropy = (len(left_child)/n) * entropy(left_child) + \
                              (len(right_child)/n) * entropy(right_child)
    return entropy(parent) - weighted_child_entropy

# Example: split customers (churns=1, stays=0)
parent = [1, 1, 0, 0, 0, 0, 1, 0]            # mixed — H = 1.0
left   = [1, 1, 1]                             # mostly churns
right  = [0, 0, 0, 0, 0]                       # mostly stays

print(f"Parent entropy:     {entropy(parent):.4f}")       # 1.0
print(f"Left entropy:       {entropy(left):.4f}")         # 0.918
print(f"Right entropy:      {entropy(right):.4f}")        # 0.0
print(f"Information Gain:   {information_gain(parent, left, right):.4f}")  # 0.549

Key Probability Distributions

Every ML model works with specific probability distributions. Know these:

Distribution	Shape	Parameters	Used for
Bernoulli	0 or 1	p (prob of 1)	Single coin flip, binary classification
Binomial	Count of successes	n, p	Number of spam emails in n messages
Normal / Gaussian	Bell curve	μ (mean), σ (std)	Measurement errors, natural phenomena, weight init
Uniform	Flat	a, b	Random sampling, weight initialization baseline
Exponential	Decaying tail	λ (rate)	Time between events (server requests, clicks)
Categorical	K discrete outcomes	p₁...pₖ	Next-token prediction, class labels

Python

import numpy as np

# --- Normal Distribution ---
mu, sigma = 0, 1
samples = np.random.normal(mu, sigma, 10000)
print(f"Mean: {samples.mean():.3f}, Std: {samples.std():.3f}")  # ~0, ~1

# --- Binomial Distribution ---
# 100 email, each has P(spam)=0.1 — how many spam?
spam_counts = np.random.binomial(n=100, p=0.1, size=5000)
print(f"Expected spam: {spam_counts.mean():.1f} (should be ≈10)")

# --- Why weights are initialized from Normal? ---
# He initialization for ReLU networks: mean=0, std=sqrt(2/n_inputs)
n_inputs = 512
weights = np.random.normal(0, np.sqrt(2/n_inputs), (512, 256))
print(f"Weight std: {weights.std():.4f}")  # ~0.0625

Central Limit Theorem (CLT) — Why Statistics Works

The Central Limit Theorem is the reason you can trust sample means, run A/B tests, and use the Normal distribution even when underlying data isn't normal.

CLT states:
  The distribution of SAMPLE MEANS approaches a Normal distribution
  as the sample size increases — regardless of the original distribution.

  If you take many samples of size n from ANY distribution:
    Mean of samples → N(μ, σ²/n)    (Normal with same mean, smaller variance)

Why it matters:
  - You can compute confidence intervals on any metric
  - You can run hypothesis tests even on non-normal data
  - Error bars on ML evaluation metrics are valid

Python

import numpy as np
import random

# Original distribution: wildly skewed exponential (NOT normal)
population = np.random.exponential(scale=2.0, size=100_000)
print(f"Population mean: {population.mean():.3f}, Skew: very right-skewed")

# Take 10,000 samples of size 50 — compute each sample mean
sample_means = []
for _ in range(10_000):
    sample = np.random.choice(population, size=50, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)
print(f"Sample means mean: {sample_means.mean():.3f}")  # ≈ 2.0 (population mean)
print(f"Sample means std:  {sample_means.std():.4f}")   # ≈ σ/√50
# → The distribution of sample means IS approximately Normal — that's CLT!

Hypothesis Testing — How ML Experiments Are Validated

Every A/B test, model comparison, and statistical claim uses hypothesis testing. Interviewers ask about this for ML Engineer and Data Scientist roles.

The Process:
  1. Null Hypothesis (H₀): "The new model is no better than the old model"
  2. Alternative (H₁): "The new model IS better"
  3. Collect data: run experiment, collect results
  4. Compute p-value: probability of observing this data IF H₀ is true
  5. Decision:
       p-value < 0.05 → reject H₀ → statistically significant improvement
       p-value ≥ 0.05 → fail to reject H₀ → not enough evidence

p-value intuition:
  p = 0.03 means: "If the model was actually not better,
  there's only a 3% chance we'd see this big of a difference by luck."
  → Low p → the improvement is unlikely to be random → ship it.

Type I and Type II Errors (critical for interviews):

Type I Error (False Positive):  Conclude model is better when it isn't
                                  → Significance level α = 0.05 = allowed Type I rate

Type II Error (False Negative): Conclude model is no better when it actually is
                                  → Power = 1 - β = probability of detecting a real effect

The trade-off:
  Tighter threshold (α=0.01) → fewer Type I errors, more Type II errors
  Loose threshold (α=0.10) → more Type I errors, fewer Type II errors

Python

from scipy import stats
import numpy as np

# --- T-test: comparing two model's accuracy scores across 5 runs ---
model_a_scores = [0.842, 0.851, 0.838, 0.845, 0.849]
model_b_scores = [0.861, 0.874, 0.869, 0.878, 0.865]

t_stat, p_value = stats.ttest_ind(model_a_scores, model_b_scores)
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("✅ Model B is significantly better (p < 0.05)")
else:
    print("❌ No statistically significant difference — don't ship on hype")

Interview answer template: When someone says "Model B is 0.5% better," ask:

Is that statistically significant (what's the p-value)?
Was the test set held out and only evaluated once?
Across how many seeds or time periods?

Eigenvalues, SVD, and PCA — The Math Behind Dimensionality Reduction

Eigenvalues and Singular Value Decomposition (SVD) are the math behind PCA, recommendation systems, and embedding compression. The intuition before the math:

Eigenvector intuition:
  Most data "stretches" in some preferred directions.
  For a dataset of heights and weights, the main stretch direction is
  "both height and weight increase together" (they're correlated).

  An eigenvector is that "preferred stretch direction."
  Its eigenvalue measures HOW MUCH the data stretches in that direction.

PCA (Principal Component Analysis):
  Step 1: Center the data (subtract mean)
  Step 2: Compute the covariance matrix
  Step 3: Find eigenvectors (the principal components)
  Step 4: Sort by eigenvalue — larger eigenvalue = more variance explained
  Step 5: Project data onto the top k eigenvectors → compressed representation

Python

import numpy as np

# --- PCA from scratch ---
def pca_from_scratch(X: np.ndarray, n_components: int) -> tuple:
    """
    Principal Component Analysis without sklearn.
    X: (n_samples, n_features)
    Returns: (transformed data, explained_variance_ratio)
    """
    # 1. Center the data
    X_centered = X - X.mean(axis=0)
    
    # 2. Compute covariance matrix
    cov_matrix = np.cov(X_centered, rowvar=False)
    
    # 3. Eigendecomposition — find principal directions
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
    
    # 4. Sort by eigenvalue (descending — largest variance first)
    sorted_idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[sorted_idx]
    eigenvectors = eigenvectors[:, sorted_idx]
    
    # 5. Select top k components
    components = eigenvectors[:, :n_components]
    
    # 6. Project data
    X_reduced = X_centered @ components
    
    # 7. Explained variance ratio
    explained_variance_ratio = eigenvalues[:n_components] / eigenvalues.sum()
    
    return X_reduced, explained_variance_ratio

# Create sample data: 2D clusters (like height/weight in 4D feature space)
np.random.seed(42)
X = np.random.randn(200, 4)  # 200 samples, 4 features
X[:, 1] = X[:, 0] * 0.8 + np.random.randn(200) * 0.2  # make feature 0 and 1 correlated

X_reduced, evr = pca_from_scratch(X, n_components=2)
print(f"Original shape: {X.shape}")          # (200, 4)
print(f"Reduced shape:  {X_reduced.shape}")  # (200, 2)
print(f"Variance explained by PC1: {evr[0]:.1%}")
print(f"Variance explained by PC2: {evr[1]:.1%}")
print(f"Total variance explained:  {evr.sum():.1%}")

# Verify using sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_sk = pca.fit_transform(X)
print(f"\nsklearn PCA variance ratios: {pca.explained_variance_ratio_}")

Where SVD/PCA appears in interviews:

"How would you compress a 512-dimensional embedding?" → PCA
"How does collaborative filtering work?" → SVD on user-item matrix
"How do you visualize high-dimensional data?" → PCA to 2D, then t-SNE/UMAP
"What's the curse of dimensionality?" → PCA as the fix

Interactive Quiz

Check yourself0/3 answered

Practice

Dot products by hand: compute cosine similarity between [1,2,0], [2,4,0] and [0,0,3] — confirm the parallel pair scores 1.0 and the perpendicular pair 0. Then do it in NumPy and feel the vector-DB query you just ran.
Gradient descent parameter test: Run the Python code above with learning rates 0.0001, 0.01, and 1.0. Verify which one underfits, which converges, and which diverges.
Bayes spam filter: Implement a two-word Naive Bayes spam classifier. If P(spam)=0.2, P("free"|spam)=0.9, P("free"|ham)=0.01 — what's P(spam|"free")? Then chain a second word.
Entropy lab: Implement the entropy() function above. Calculate the entropy of a perfectly balanced 3-class dataset vs. one with 80%/10%/10% split. Why does the balanced set have higher entropy?
PCA lab: Run the PCA from-scratch implementation above. Try n_components=3. How much cumulative variance do 3 components explain? Compare with sklearn's output.
Hypothesis testing: Run the t-test code on model scores. Change the scores so they're very close together — observe when the p-value exceeds 0.05. That's when you "can't claim improvement."

Next: Classical ML — the algorithms these four ideas power, and why they still beat deep learning on most tables.