How much math do you actually need?
Honest calibration first: to use ML (call models, build RAG, fine-tune), you need intuitions, not proofs. To build models or read papers, you need more. This page is the intuition layer — four ideas, each explained until it's obvious, each mapped to where you'll meet it. Nothing here requires more than Level 1 programming and high-school algebra.
The map:
Linear algebra → how data and models are REPRESENTED (vectors, matrices)
Calculus → how models LEARN (gradients = which way to nudge)
Probability → how models express UNCERTAINTY (distributions, likelihood)
Statistics → how you know it WORKED (sampling, significance, baselines)
Linear algebra: data is vectors, models are matrices
Vectors — a point is a list of numbers
A vector is just an ordered list of numbers — and the leap that
unlocks all of ML: anything can be one. A house is
[area, bedrooms, age] = [1200, 3, 15]. A word, after
embedding, is 1,536 numbers. An image is pixel values.
Once data is vectors, "similar things" becomes "nearby points," and
geometry becomes your query language.
The dot product — the most important operation in ML
Multiply two vectors element-by-element and sum:
a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ
[2, 1, 0] · [3, 1, 4] = 6 + 1 + 0 = 7
Two readings, both essential:
- Similarity. When two vectors point the same way, their dot product is large; perpendicular → zero; opposite → negative. Normalize by length and you get cosine similarity — the number your vector database computes a billion times a day, and the "attention score" inside transformers. Semantic search is dot products.
- A weighted sum.
features · weights= each feature times how much it matters, summed — which is literally what a linear model predicts and what each neuron in a neural network computes before its activation. A neuron is a dot product with an opinion.
Matrices — many dot products at once
A matrix is a grid of numbers; matrix-times-vector is "take the dot product of the vector with every row." So one matrix multiply runs every neuron in a layer simultaneously — and that's the whole secret of why GPUs run AI: graphics cards are matrix-multiply machines, and a neural network is a stack of matrix multiplies. "Model weights" — the gigabytes in a model file — are these matrices' entries; a "70B-parameter model" has 70 billion such numbers (fine-tuning nudges them; LoRA adds small matrices beside them).
Calculus: the gradient is a compass
Forget integrals; ML needs one idea. A model's loss is a single number measuring how wrong it currently is (the error in training). Training is: change the weights to make the loss smaller. With 70 billion weights, which ones, and which way?
The derivative answers it for one weight: "if I nudge this weight up a hair, does the loss go up or down, and how steeply?" The gradient is just that answer for every weight at once — a vector of slopes, pointing in the direction of steepest increase. So:
GRADIENT DESCENT (all of deep learning's training, one line):
weights ← weights − learning_rate × gradient
"Take a small step DOWNHILL, repeat a few million times."
- learning_rate = step size: too big → you leap over valleys and
diverge; too small → training takes geologic time
(the knob you met in fine-tuning)
- backpropagation = the bookkeeping algorithm that computes the
gradient efficiently through stacked layers (chain rule, applied
relentlessly — the "how" of neural-networks training)
That hiking-downhill-in-fog picture — you can't see the valley, only feel the slope under your feet — is genuinely the mental model researchers use. Loss curves flattening = the terrain leveled off; "stuck in a local minimum" = a valley that isn't the deepest one.
Probability: models that admit uncertainty
A classifier doesn't say "spam" — it says P(spam) = 0.93, a probability distribution over outcomes (numbers ≥ 0 summing to 1). An LLM's entire output is one: given the text so far, a distribution over every possible next token ("the cat sat on the mat 0.41, floor 0.18, …" — LLMs are next-token distributions, sampled). Three working ideas:
- Conditional probability — P(A given B): the "given" is what changes everything. P(spam) might be 0.1; P(spam given the word "lottery") might be 0.9. Models are conditional-probability machines: P(label | features), P(next token | context).
- Expectation — the probability-weighted average: what you'd get on average over many draws. Loss functions are expectations over the training data; "expected revenue per user" is the same math in a business meeting.
- Likelihood & training — "maximize likelihood" = choose weights under which the actual observed data would have been most probable. Cross-entropy loss — the loss of essentially every classifier and LLM — is this, in logarithms. When you read "training minimizes cross-entropy," hear: make the truth look probable.
And the one distribution to know by name: the normal (Gaussian) — the bell curve that measurement noise and many natural quantities follow, the default assumption behind "mean ± standard deviation."
Statistics: knowing whether it worked
The unglamorous quarter that saves real money:
- Sampling bias — your model is only as general as the data was representative (the fine-tuning distribution-mismatch bug). Train on daytime photos, fail at night.
- Train/test discipline — accuracy on data the model has seen is memory, not learning; held-out evaluation is the entire epistemology of ML (classical ML makes this concrete).
- Baselines — "94% accurate" means nothing until you know the dumbest model's score (predict the majority class: 91%?). Always ask "compared to what?"
- Variance & significance — rerun training with a different random seed and accuracy moves a point; a 0.5% "improvement" might be noise. A/B tests (the product version) exist because per-user variance swamps small effects.
Practice
- Dot products by hand: compute cosine similarity between
[1,2,0],[2,4,0]and[0,0,3]— confirm the parallel pair scores 1.0 and the perpendicular pair 0. Then do it in NumPy and feel the vector-DB query you just ran. - Gradient descent in 15 lines: fit
y = wx + bto ten points with the update rule above — no libraries. Print the loss each step; try learning rates 0.0001, 0.01, and 1.0, and watch all three behaviors from the QA. - Distribution feel: simulate 10,000 coin-flip batches of 100 in Python; histogram the heads-counts and meet the bell curve. Then compute how often a fair coin gives ≥60 heads — your first significance intuition.
- Baseline audit: for any dataset you've touched, compute the majority-class baseline before any model. Make it a permanent habit.
Next: Classical ML — the algorithms these four ideas power, and why they still beat deep learning on most tables.