Classical ML

Regression, trees and clustering — the algorithms that still win on tabular data, plus the evaluation discipline (precision, recall, overfitting) that carries to all of ML.

machine-learningregressionclassificationclustering

Why learn 'old' ML in the LLM era

Three reasons, all practical. It still wins: on tabular business data — churn, fraud, pricing, credit — gradient-boosted trees beat neural networks routinely, train in seconds, and explain themselves; most production "AI" at most companies is this. It's where the concepts live: overfitting, train/test splits, precision/recall — the evaluation discipline that fine-tuning and every LLM eval inherits — are learnable here in five minutes per experiment instead of five hours. It's interview material: ML roles screen on exactly this page.

The three problem shapes:

SUPERVISED (you have labeled examples):
  Regression       → predict a NUMBER        (price, demand, ETA)
  Classification   → predict a CATEGORY      (spam/ham, churn/stay, fraud)

UNSUPERVISED (no labels — find structure):
  Clustering       → discover natural GROUPS (user segments, anomalies)

Linear regression — the hello world that won't die

Predict a number as a weighted sum of features — the dot product with a name:

Python
price = w₁·area + w₂·bedrooms + w₃·age + b

# scikit-learn (the classical-ML standard library):
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)   # finds the best w's
preds = model.predict(X_test)

"Fit" means: choose weights minimizing the mean squared error between predictions and truth — solvable directly or by gradient descent. Don't let the simplicity fool you: linear regression is interpretable (each weight says how much a bedroom is worth), fast at any scale, and the baseline every fancier model must beat to justify itself (baselines, always).

Logistic regression is its classification twin: the weighted sum is squashed through a sigmoid into a probability — P(churn) = σ(w·x + b) — and despite the name and its age, it's still the default first classifier in industry: calibrated probabilities, millisecond inference, coefficients you can show a regulator.

Trees and forests — the tabular champions

A decision tree asks a sequence of threshold questions, learned from data:

                is balance < ₹1,000?
               /                    \
            yes                      no
            /                          \
   logins last week < 2?          [stays: 94%]
       /          \
    yes            no
     |              |
 [churns: 81%]  [stays: 72%]

Trees are readable (that diagram is the model), handle mixed feature types without scaling, and capture non-linear interactions linear models can't ("low balance matters only when logins are rare"). Alone, one tree overfits — grown deep enough it memorizes the training set. The fixes became the industry's workhorses:

  • Random forest — train hundreds of trees on random subsets of rows and features; average their votes. Variance cancels; accuracy jumps. Nearly tuning-free.
  • Gradient boosting (XGBoost/LightGBM) — train trees sequentially, each correcting the previous ensemble's errors (gradient descent, but each "step" is a whole tree). The Kaggle-and-industry king of tabular data — LandAI's price model is exactly this.

When does deep learning beat them? When features must be learned from raw perception — pixels, audio, text (neural networks onward). When a human can name the columns, trees usually win. That sentence is the model-selection interview answer.

K-means — clustering in one idea

No labels; find k natural groups: place k centers randomly, then repeat — assign every point to its nearest center (distance in vector space), move each center to its points' average — until stable. Segmenting users by behavior, grouping similar support tickets, compressing colors — k-means is the first tool. Its honest limits: you choose k (the elbow heuristic helps), it assumes roughly-round clusters, and results vary with initialization. The conceptual bridge: clustering in embedding space — k-means over text embeddings — is how you find themes in a million documents today.

Evaluation — the discipline that is the actual lesson

Overfitting and the split

A model can score 100% by memorizing training data and still fail on new data — overfitting, the central failure mode of all ML (fine-tuning's enemy, met here first). The defense is procedural, not clever:

data → TRAIN (fit the model) / VALIDATION (tune choices) / TEST (touch ONCE)

The test set answers one question, one time: "will this generalize?"
Tune against it repeatedly and it silently becomes a second training
set — the most common self-deception in applied ML.

Watch the gap: training accuracy 99%, validation 78% → overfitting (simplify the model, get more data, regularize — regularization penalizes large weights, mathematically preferring simpler explanations). Both low → underfitting (model too simple for the pattern).

Precision and recall (the interview centerpiece)

Accuracy lies on imbalanced data: 99% of transactions are legitimate, so "never flag anything" is 99% accurate and 100% useless. The honest pair:

                       PREDICTED
                    fraud      legit
ACTUAL   fraud       TP         FN      ← recall = TP/(TP+FN)
         legit       FP         TN          "of real frauds, how many caught?"

precision = TP/(TP+FP)  →  "of flagged, how many were real?"

They trade off through the decision threshold: flag at P(fraud) > 0.3 and recall rises while precision falls (more catches, more false alarms); at > 0.9, the reverse. The threshold is a business decision: cancer screening wants recall (missing a case is catastrophic, a follow-up test is cheap); spam filtering wants precision (deleting a real email is worse than a stray spam). F1 averages them when you must rank single numbers; an ROC/PR curve shows the whole menu. "Which matters more here, and what does each error cost?" is the question that makes you sound like you've shipped one.

Common mistakes

  • Skipping the baseline — majority class for classification, mean for regression, last-value for time series. No baseline, no claim.
  • Data leakage — a feature that encodes the answer ("account_closed_date" predicting churn), or normalizing/tuning using test data. Symptom: results too good; outcome: production collapse. The time-series version — random splits letting the model train on the future — is the classic.
  • Accuracy on imbalanced classes — the 99%-useless trap above.
  • Jumping to deep learning on 5,000 tabular rows — XGBoost in seconds beats an underfed network in hours (pattern fever, ML edition).
  • Ignoring calibration — a model saying "90%" should be right ~90% of those times if you're using the probabilities (for pricing or ranking), not just the argmax.
  • One metric, no slices — fine on average, broken for one region or segment (the observability instinct: percentiles and slices, not means).

Practice

  1. The full loop in 30 lines: scikit-learn's Titanic or churn dataset — baseline, logistic regression, random forest; train/val/test split; confusion matrix; one paragraph: which model ships and why.
  2. Overfit on purpose: fit a decision tree with no depth limit; compare train vs test accuracy; then sweep max_depth 1→20 and plot both curves — the U-shape you see is the bias-variance trade-off, and you'll never need it explained again.
  3. Threshold lab: for your churn model, sweep the threshold 0.1→0.9; plot precision and recall; mark where you'd operate if a retention call costs ₹100 and a lost customer ₹2,000.
  4. Cluster something real: k-means over your word-frequency vectors from Level 1 or embeddings of 100 news headlines; inspect the groups; change k and watch meaning shift.

Next: Neural Networks — what happens when the features must be learned too.