Math Foundations for ML

The minimal math that makes ML stop being magic: vectors as data, dot products as similarity, gradients as direction, and probability as honesty.

mathlinear-algebraprobabilitycalculusml

How much math do you actually need?

Honest calibration first: to use ML (call models, build RAG, fine-tune), you need intuitions, not proofs. To build models or read papers, you need more. This page is the intuition layer — four ideas, each explained until it's obvious, each mapped to where you'll meet it. Nothing here requires more than Level 1 programming and high-school algebra.

The map:

  Linear algebra  →  how data and models are REPRESENTED (vectors, matrices)
  Calculus        →  how models LEARN (gradients = which way to nudge)
  Probability     →  how models express UNCERTAINTY (distributions, likelihood)
  Statistics      →  how you know it WORKED (sampling, significance, baselines)

Linear algebra: data is vectors, models are matrices

Vectors — a point is a list of numbers

A vector is just an ordered list of numbers — and the leap that unlocks all of ML: anything can be one. A house is [area, bedrooms, age] = [1200, 3, 15]. A word, after embedding, is 1,536 numbers. An image is pixel values. Once data is vectors, "similar things" becomes "nearby points," and geometry becomes your query language.

The dot product — the most important operation in ML

Multiply two vectors element-by-element and sum:

a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ

[2, 1, 0] · [3, 1, 4] = 6 + 1 + 0 = 7

Two readings, both essential:

  1. Similarity. When two vectors point the same way, their dot product is large; perpendicular → zero; opposite → negative. Normalize by length and you get cosine similarity — the number your vector database computes a billion times a day, and the "attention score" inside transformers. Semantic search is dot products.
  2. A weighted sum. features · weights = each feature times how much it matters, summed — which is literally what a linear model predicts and what each neuron in a neural network computes before its activation. A neuron is a dot product with an opinion.

Matrices — many dot products at once

A matrix is a grid of numbers; matrix-times-vector is "take the dot product of the vector with every row." So one matrix multiply runs every neuron in a layer simultaneously — and that's the whole secret of why GPUs run AI: graphics cards are matrix-multiply machines, and a neural network is a stack of matrix multiplies. "Model weights" — the gigabytes in a model file — are these matrices' entries; a "70B-parameter model" has 70 billion such numbers (fine-tuning nudges them; LoRA adds small matrices beside them).

Calculus: the gradient is a compass

Forget integrals; ML needs one idea. A model's loss is a single number measuring how wrong it currently is (the error in training). Training is: change the weights to make the loss smaller. With 70 billion weights, which ones, and which way?

The derivative answers it for one weight: "if I nudge this weight up a hair, does the loss go up or down, and how steeply?" The gradient is just that answer for every weight at once — a vector of slopes, pointing in the direction of steepest increase. So:

GRADIENT DESCENT (all of deep learning's training, one line):

    weights ← weights − learning_rate × gradient

"Take a small step DOWNHILL, repeat a few million times."

  - learning_rate = step size: too big → you leap over valleys and
    diverge; too small → training takes geologic time
    (the knob you met in fine-tuning)
  - backpropagation = the bookkeeping algorithm that computes the
    gradient efficiently through stacked layers (chain rule, applied
    relentlessly — the "how" of neural-networks training)

That hiking-downhill-in-fog picture — you can't see the valley, only feel the slope under your feet — is genuinely the mental model researchers use. Loss curves flattening = the terrain leveled off; "stuck in a local minimum" = a valley that isn't the deepest one.

Probability: models that admit uncertainty

A classifier doesn't say "spam" — it says P(spam) = 0.93, a probability distribution over outcomes (numbers ≥ 0 summing to 1). An LLM's entire output is one: given the text so far, a distribution over every possible next token ("the cat sat on the mat 0.41, floor 0.18, …" — LLMs are next-token distributions, sampled). Three working ideas:

  • Conditional probability — P(A given B): the "given" is what changes everything. P(spam) might be 0.1; P(spam given the word "lottery") might be 0.9. Models are conditional-probability machines: P(label | features), P(next token | context).
  • Expectation — the probability-weighted average: what you'd get on average over many draws. Loss functions are expectations over the training data; "expected revenue per user" is the same math in a business meeting.
  • Likelihood & training — "maximize likelihood" = choose weights under which the actual observed data would have been most probable. Cross-entropy loss — the loss of essentially every classifier and LLM — is this, in logarithms. When you read "training minimizes cross-entropy," hear: make the truth look probable.

And the one distribution to know by name: the normal (Gaussian) — the bell curve that measurement noise and many natural quantities follow, the default assumption behind "mean ± standard deviation."

Statistics: knowing whether it worked

The unglamorous quarter that saves real money:

  • Sampling bias — your model is only as general as the data was representative (the fine-tuning distribution-mismatch bug). Train on daytime photos, fail at night.
  • Train/test discipline — accuracy on data the model has seen is memory, not learning; held-out evaluation is the entire epistemology of ML (classical ML makes this concrete).
  • Baselines — "94% accurate" means nothing until you know the dumbest model's score (predict the majority class: 91%?). Always ask "compared to what?"
  • Variance & significance — rerun training with a different random seed and accuracy moves a point; a 0.5% "improvement" might be noise. A/B tests (the product version) exist because per-user variance swamps small effects.

Practice

  1. Dot products by hand: compute cosine similarity between [1,2,0], [2,4,0] and [0,0,3] — confirm the parallel pair scores 1.0 and the perpendicular pair 0. Then do it in NumPy and feel the vector-DB query you just ran.
  2. Gradient descent in 15 lines: fit y = wx + b to ten points with the update rule above — no libraries. Print the loss each step; try learning rates 0.0001, 0.01, and 1.0, and watch all three behaviors from the QA.
  3. Distribution feel: simulate 10,000 coin-flip batches of 100 in Python; histogram the heads-counts and meet the bell curve. Then compute how often a fair coin gives ≥60 heads — your first significance intuition.
  4. Baseline audit: for any dataset you've touched, compute the majority-class baseline before any model. Make it a permanent habit.

Next: Classical ML — the algorithms these four ideas power, and why they still beat deep learning on most tables.