Classical ML · PrepDeck

Regression, trees and clustering — the algorithms that still win on tabular data, plus the evaluation discipline (precision, recall, overfitting) that carries to all of ML.

Starting from Zero — A Physical Intuition

Before looking at equations, let's understand classical ML through a physical analogy:

Decision Trees as a Flow-Chart Card Deck: Imagine you have a physical deck of cards, where each card represents a customer. To predict if they will churn, you create a checklist: "Is their balance less than $100?" If yes, move them to the left pile; otherwise, move them to the right. Next, for the left pile, ask: "Have they logged in this week?" This splitting process continues until you have homogeneous piles. The algorithm simply finds the best yes/no questions to divide the deck automatically.
Ensembles (Random Forests) as a Council of Wizards: Instead of trusting a single decision tree (which might be biased by a specific card order), you hire a committee of 100 wizards. You give each wizard a slightly different, randomized sample of cards and features. Each wizard constructs their own flow-chart tree and votes on the final outcome. The majority vote is your prediction. This cancels out individual errors, giving a robust prediction.
Gradient Boosting (XGBoost) as a Sequential Teacher-Student Chain: Instead of training wizards in parallel, you train them sequentially. The first wizard makes predictions and gets some wrong. The second wizard is hired specifically to correct the mistakes of the first wizard. The third wizard corrects the remaining errors of the first two. They form a sequential chain where each tree builds on the failures of its predecessors.

Why learn 'old' ML in the LLM era

Three reasons, all practical. It still wins: on tabular business data — churn, fraud, pricing, credit — gradient-boosted trees beat neural networks routinely, train in seconds, and explain themselves; most production "AI" at most companies is this. It's where the concepts live: overfitting, train/test splits, precision/recall — the evaluation discipline that fine-tuning and every LLM eval inherits — are learnable here in five minutes per experiment instead of five hours. It's interview material: ML roles screen on exactly this page.

The three problem shapes:

SUPERVISED (you have labeled examples):
  Regression       → predict a NUMBER        (price, demand, ETA)
  Classification   → predict a CATEGORY      (spam/ham, churn/stay, fraud)

UNSUPERVISED (no labels — find structure):
  Clustering       → discover natural GROUPS (user segments, anomalies)

Linear regression — the hello world that won't die

Predict a number as a weighted sum of features — the dot product with a name:

Python

price = w₁·area + w₂·bedrooms + w₃·age + b

# scikit-learn (the classical-ML standard library):
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)   # finds the best w's
preds = model.predict(X_test)

"Fit" means: choose weights minimizing the mean squared error between predictions and truth — solvable directly or by gradient descent. Don't let the simplicity fool you: linear regression is interpretable (each weight says how much a bedroom is worth), fast at any scale, and the baseline every fancier model must beat to justify itself (baselines, always).

Logistic regression is its classification twin: the weighted sum is squashed through a sigmoid into a probability — P(churn) = σ(w·x + b) — and despite the name and its age, it's still the default first classifier in industry: calibrated probabilities, millisecond inference, coefficients you can show a regulator.

Trees and forests — the tabular champions

A decision tree asks a sequence of threshold questions, learned from data:

                is balance < ₹1,000?
               /                    \
            yes                      no
            /                          \
   logins last week < 2?          [stays: 94%]
       /          \
    yes            no
     |              |
 [churns: 81%]  [stays: 72%]

Here is a complete, library-free Python helper that determines the optimal threshold to split customers on a feature (e.g. account balance) using Gini Impurity:

Python

from typing import List, Tuple

def calculate_gini(labels: List[int]) -> float:
    # Gini Impurity: 1 - sum(p_i^2)
    if not labels:
        return 0.0
    total = len(labels)
    # Binary classification labels: 0 (Stay) or 1 (Churn)
    p_churn = sum(labels) / total
    p_stay = 1.0 - p_churn
    return 1.0 - (p_churn**2 + p_stay**2)

def find_best_split(
    data: List[Tuple[float, int]]
) -> Tuple[float, float, List[int], List[int]]:
    # data: list of (feature_value, label) tuples sorted by feature_value
    data_sorted = sorted(data, key=lambda x: x[0])
    best_gini = 999.0
    best_threshold = 0.0
    best_left, best_right = [], []
    
    # Try midpoints between successive values as thresholds
    for i in range(len(data_sorted) - 1):
        threshold = (data_sorted[i][0] + data_sorted[i+1][0]) / 2.0
        left = [label for val, label in data_sorted if val <= threshold]
        right = [label for val, label in data_sorted if val > threshold]
        
        # Weighted Gini of the split
        weight_left = len(left) / len(data_sorted)
        weight_right = len(right) / len(data_sorted)
        split_gini = weight_left * calculate_gini(left) + weight_right * calculate_gini(right)
        
        if split_gini < best_gini:
            best_gini = split_gini
            best_threshold = threshold
            best_left, best_right = left, right
            
    return best_threshold, best_gini, best_left, best_right

# Example dataset: (account_balance, churn_label)
# 1 = Churn, 0 = Stay
customers = [(500.0, 1), (800.0, 1), (1200.0, 0), (1500.0, 0), (2000.0, 0)]
threshold, gini, left, right = find_best_split(customers)
print(f"Best split threshold: {threshold:.1f}")
print(f"Split Gini Impurity: {gini:.4f}")
print(f"Left partition labels: {left} | Right partition labels: {right}")

Trees are readable (that diagram is the model), handle mixed feature types without scaling, and capture non-linear interactions linear models can't ("low balance matters only when logins are rare"). Alone, one tree overfits — grown deep enough it memorizes the training set. The fixes became the industry's workhorses:

Random forest — train hundreds of trees on random subsets of rows and features; average their votes. Variance cancels; accuracy jumps. Nearly tuning-free.
Gradient boosting (XGBoost/LightGBM) — train trees sequentially, each correcting the previous ensemble's errors (gradient descent, but each "step" is a whole tree). The Kaggle-and-industry king of tabular data — LandAI's price model is exactly this.

When does deep learning beat them? When features must be learned from raw perception — pixels, audio, text (neural networks onward). When a human can name the columns, trees usually win. That sentence is the model-selection interview answer.

K-means — clustering in one idea

No labels; find k natural groups: place k centers randomly, then repeat — assign every point to its nearest center (distance in vector space), move each center to its points' average — until stable. Segmenting users by behavior, grouping similar support tickets, compressing colors — k-means is the first tool. Its honest limits: you choose k (the elbow heuristic helps), it assumes roughly-round clusters, and results vary with initialization. The conceptual bridge: clustering in embedding space — k-means over text embeddings — is how you find themes in a million documents today.

KNN — K-Nearest Neighbors: The Simplest Classifier

KNN is the purest beginner algorithm: "find the K most similar training examples and take a vote." No model is fitted; prediction is the algorithm itself.

Predicting the class of a new point:
  1. Compute distance to every training point (usually Euclidean distance)
  2. Find the K nearest neighbors
  3. Return the majority class (classification) or mean value (regression)

Example — K=3:
  New customer: [age=28, salary=55000]
  Neighbor 1:   [age=27, salary=52000] → label: "stays"
  Neighbor 2:   [age=30, salary=58000] → label: "stays"
  Neighbor 3:   [age=26, salary=48000] → label: "churns"
  Vote: 2 "stays" > 1 "churns" → predict "stays"

Python

import numpy as np
from collections import Counter
from typing import List

class KNNClassifier:
    """K-Nearest Neighbors from scratch."""

    def __init__(self, k: int = 3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        # KNN has no training — just memorize the data
        self.X_train = X
        self.y_train = y

    def _euclidean_distance(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.sqrt(np.sum((a - b) ** 2))

    def predict(self, X: np.ndarray) -> np.ndarray:
        return np.array([self._predict_single(x) for x in X])

    def _predict_single(self, x: np.ndarray) -> int:
        # Compute distance to every training point
        distances = [self._euclidean_distance(x, x_train) for x_train in self.X_train]
        # Get indices of K nearest neighbors
        k_nearest_indices = np.argsort(distances)[:self.k]
        # Get their labels
        k_nearest_labels = self.y_train[k_nearest_indices]
        # Return majority vote
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

# Quick demo
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=300, n_features=4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# CRITICAL: KNN needs feature scaling — distances are meaningless otherwise!
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

knn = KNNClassifier(k=5)
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
accuracy = (preds == y_test).mean()
print(f"KNN (k=5) accuracy: {accuracy:.3f}")

KNN's key properties:

✅ No training time — lazy learner
✅ Naturally handles multi-class
✅ Good baseline; easy to explain
❌ Slow at prediction — O(n×d) per query
❌ Fails in high dimensions (curse of dimensionality)
❌ Must scale features — salary (₹50,000) dominates age (30) otherwise

When to use: small datasets, quick baselines, anomaly detection, recommendation fallback.

SVM — Support Vector Machines: Maximum Margin Classifier

SVM finds the decision boundary that maximizes the margin — the gap between the two classes. It doesn't just find a line that separates classes; it finds the best line.

Intuition:

  Class A: ●  ●  ●        Class B: ○  ○  ○

  Bad boundary:  passes close to some points → fragile
  SVM boundary:  maximum distance from the nearest points of each class

  The "support vectors" are the data points closest to the boundary.
  Only these points define the boundary — all others are irrelevant.

The Kernel Trick — handling non-linear data:

If data isn't linearly separable in the current space:
  → project it into a higher-dimensional space where it IS separable
  → the kernel function computes this projection implicitly (without
     actually computing the high-dimensional coordinates)

Common kernels:
  Linear:   good for text, high-dimensional data
  RBF/Gaussian: good for non-linear tabular data (most common default)
  Polynomial: for polynomial decision boundaries

Python

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Non-linearly separable data (two interleaved crescents)
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# SVM REQUIRES scaling (it's distance-based internally)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Linear SVM — for linearly separable data
svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_train, y_train)
print("Linear SVM:", svm_linear.score(X_test, y_test))

# RBF SVM — for non-linear data (default, usually wins)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_train, y_train)
print("RBF SVM:", svm_rbf.score(X_test, y_test))
print(classification_report(y_test, svm_rbf.predict(X_test)))

# Key hyperparameters:
# C: regularization — small C = wide margin (more errors allowed)
#                     large C = narrow margin (fewer training errors, may overfit)
# gamma (RBF only): how far each training point's influence reaches
#                   small gamma = smooth boundary, large gamma = wiggly (overfit)

SVM interview summary:

Best for: small/medium datasets, text classification (linear kernel), non-linear patterns (RBF)
Requires: feature scaling always
Weakness: slow on large datasets (>100K rows), not probabilistic by default
The "C" hyperparameter is the bias-variance dial: low C = high bias, high C = high variance

Naive Bayes — Probabilistic Text Classification

Naive Bayes applies Bayes' Theorem with one bold assumption: features are independent given the class. It's "naive" because that's almost never true — yet it works surprisingly well for text.

For spam detection:
  P(spam | "lottery", "free", "click") ∝ P(spam) × P("lottery"|spam) × P("free"|spam) × P("click"|spam)

  The "naive" independence assumption means:
  P("lottery" AND "free" | spam) = P("lottery"|spam) × P("free"|spam)
  (ignoring that these words co-occur together in real spam)

  Despite this, Naive Bayes dominates email spam filters because:
  - Extremely fast (no iteration, just counting word frequencies)
  - Works well with few training examples
  - Handles high-dimensional text (50,000 vocab) naturally

Python

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# --- Text classification example ---
emails = [
    "Win a free lottery prize now click here",    # spam
    "Get free money fast no effort required",      # spam
    "Hi John, can we meet tomorrow for the project review?",  # ham
    "The quarterly report is attached please review",  # ham
    "Congratulations! You have been selected for a prize",    # spam
]
labels = [1, 1, 0, 0, 1]  # 1=spam, 0=ham

# Pipeline: text → bag-of-words counts → Naive Bayes
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),       # convert text to word counts
    ('classifier', MultinomialNB(alpha=1.0)) # alpha=1 is Laplace smoothing
])

pipeline.fit(emails, labels)

# Test on new emails
test_emails = ["You won a free prize!", "Meeting at 3pm tomorrow"]
predictions = pipeline.predict(test_emails)
print(f"'free prize' → {'spam' if predictions[0] else 'ham'}")  # spam
print(f"'meeting'   → {'spam' if predictions[1] else 'ham'}")   # ham

# --- Gaussian NB for continuous features ---
# When features are continuous (not word counts), use GaussianNB
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
gnb = GaussianNB()
gnb.fit(X[:120], y[:120])
print(f"Gaussian NB accuracy: {gnb.score(X[120:], y[120:]):.3f}")

When Naive Bayes wins:

Text classification (spam, sentiment, topic detection)
Very small training sets (with few features)
When you need a fast probabilistic baseline
Real-time filtering (microsecond predictions)

Feature Scaling — When and Why to Scale

Not all algorithms need scaling. Getting this wrong costs you accuracy or wastes compute.

Algorithms that NEED scaling:      Algorithms that DON'T need scaling:
  KNN (distance-based)               Decision Trees (threshold-based splits)
  SVM (margin maximization)          Random Forest
  Logistic Regression                XGBoost / LightGBM
  Neural Networks                    Naive Bayes (probabilistic)
  PCA (variance-based)               Gradient Boosting

Rule of thumb: any algorithm that computes distances or uses gradient
               descent needs scaling. Tree-based models don't.

Python

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

X = [[1000, 1], [2000, 2], [3000, 3], [4000, 4]]  # salary (large) + experience (small)

# StandardScaler: (x - mean) / std → mean=0, std=1
# Best for: neural networks, logistic regression, SVM
std_scaler = StandardScaler()
X_std = std_scaler.fit_transform(X)
print("StandardScaler:\n", X_std)

# MinMaxScaler: (x - min) / (max - min) → range [0, 1]
# Best for: when you need bounded features, image pixels (0-255 → 0-1)
mm_scaler = MinMaxScaler()
X_mm = mm_scaler.fit_transform(X)
print("\nMinMaxScaler:\n", X_mm)

# RobustScaler: (x - median) / IQR → robust to outliers
# Best for: data with extreme outliers (salaries, prices, click counts)
rb_scaler = RobustScaler()
X_rb = rb_scaler.fit_transform(X)
print("\nRobustScaler:\n", X_rb)

# GOLDEN RULE: Always fit on training data only, transform both!
# scaler.fit_transform(X_train)  ← fit AND transform
# scaler.transform(X_test)       ← ONLY transform (no fit)

k-Fold Cross-Validation — Reliable Model Evaluation

A single train/test split gives you one evaluation — maybe you got lucky or unlucky with the split. k-Fold CV gives you k evaluations on k different splits and averages them.

k=5 Fold CV:

  Full dataset: [████████████████████████████████████████]

  Fold 1: [TEST |  TRAIN  |  TRAIN  |  TRAIN  |  TRAIN  ]  → score_1
  Fold 2: [ TRAIN  | TEST |  TRAIN  |  TRAIN  |  TRAIN  ]  → score_2
  Fold 3: [ TRAIN  |  TRAIN  | TEST |  TRAIN  |  TRAIN  ]  → score_3
  Fold 4: [ TRAIN  |  TRAIN  |  TRAIN  | TEST |  TRAIN  ]  → score_4
  Fold 5: [ TRAIN  |  TRAIN  |  TRAIN  |  TRAIN  | TEST ]  → score_5

  Final score = mean(score_1 ... score_5)
  Variance    = std(score_1 ... score_5)  ← shows stability!

Python

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-Fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# StratifiedKFold ensures class proportions are preserved in each fold
# (critical for imbalanced datasets!)

scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

print(f"F1 scores per fold: {scores.round(3)}")
print(f"Mean F1:  {scores.mean():.4f}")
print(f"Std F1:   {scores.std():.4f}")  # low std = stable model

# Rule of thumb:
# std < 0.02 → very stable, trust the mean
# std > 0.05 → high variance, model may be sensitive to data splits

Hyperparameter Tuning — GridSearchCV and RandomizedSearchCV

Model hyperparameters (like max_depth, n_estimators, C) are settings you choose before training. Tuning them systematically beats guessing.

Python

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

# --- Grid Search: exhaustive (try every combination) ---
# Good when search space is small
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}
# 3 × 3 × 2 = 18 combinations × 5 folds = 90 model fits

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,    # use all CPU cores
    verbose=1
)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1:     {grid_search.best_score_:.4f}")

# --- Randomized Search: sample randomly (good for large spaces) ---
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [None, 3, 5, 10, 20],
    'min_samples_split': randint(2, 20),
    'max_features': uniform(0.3, 0.7)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=30,       # try 30 random combinations (not all)
    cv=5,
    scoring='f1',
    random_state=42,
    n_jobs=-1
)
random_search.fit(X, y)
print(f"\nBest params (random): {random_search.best_params_}")
print(f"Best F1 (random):     {random_search.best_score_:.4f}")

# Use the best model for final evaluation on held-out test set
best_model = random_search.best_estimator_

Algorithm Selection Flowchart — The Interview Decision Tree

When asked "which model would you use?", walk this decision tree:

START: What kind of problem is it?
│
├── SUPERVISED (have labels)
│   │
│   ├── CLASSIFICATION (predict category)
│   │   │
│   │   ├── Is the data TABULAR (named columns)?
│   │   │   ├── Small dataset (<10K rows)?  → Logistic Regression or SVM
│   │   │   └── Large dataset?
│   │   │       ├── Need explainability?   → Random Forest (+ SHAP values)
│   │   │       └── Max accuracy?          → XGBoost / LightGBM
│   │   │
│   │   ├── Is the data TEXT?              → Naive Bayes (baseline) or Transformer
│   │   └── Is the data IMAGE/AUDIO?       → CNN or Vision Transformer
│   │
│   └── REGRESSION (predict number)
│       ├── Simple, need explainability?   → Linear Regression
│       ├── Non-linear tabular?            → XGBoost or Random Forest
│       └── Complex patterns?             → Neural Network
│
├── UNSUPERVISED (no labels)
│   ├── Find groups?                       → K-means (round clusters) or DBSCAN
│   ├── Reduce dimensions?                 → PCA (linear) or UMAP (non-linear)
│   └── Detect anomalies?                 → Isolation Forest or Autoencoder
│
└── Always: start with a BASELINE
    → majority-class / mean / last-value
    → then simplest model (Logistic Regression / Linear Regression)
    → complexity only if baseline clearly fails

The interview one-liner: "I'd start with the simplest model that fits the data type and problem shape, measure it against a baseline, then add complexity only when the gap justifies the cost in training time and explainability."

Evaluation — the discipline that is the actual lesson

Overfitting and the split

A model can score 100% by memorizing training data and still fail on new data — overfitting, the central failure mode of all ML (fine-tuning's enemy, met here first). The defense is procedural, not clever:

data → TRAIN (fit the model) / VALIDATION (tune choices) / TEST (touch ONCE)

The test set answers one question, one time: "will this generalize?"
Tune against it repeatedly and it silently becomes a second training
set — the most common self-deception in applied ML.

Watch the gap: training accuracy 99%, validation 78% → overfitting (simplify the model, get more data, regularize — regularization penalizes large weights, mathematically preferring simpler explanations). Both low → underfitting (model too simple for the pattern).

Precision and recall (the interview centerpiece)

Accuracy lies on imbalanced data: 99% of transactions are legitimate, so "never flag anything" is 99% accurate and 100% useless. The honest pair:

                       PREDICTED
                    fraud      legit
ACTUAL   fraud       TP         FN      ← recall = TP/(TP+FN)
         legit       FP         TN          "of real frauds, how many caught?"

precision = TP/(TP+FP)  →  "of flagged, how many were real?"

They trade off through the decision threshold: flag at P(fraud) > 0.3 and recall rises while precision falls (more catches, more false alarms); at > 0.9, the reverse. The threshold is a business decision: cancer screening wants recall (missing a case is catastrophic, a follow-up test is cheap); spam filtering wants precision (deleting a real email is worse than a stray spam). F1 averages them when you must rank single numbers; an ROC/PR curve shows the whole menu. "Which matters more here, and what does each error cost?" is the question that makes you sound like you've shipped one.

Think it through like the interview

Don't just define precision and recall — derive how to configure them for real business value.

Think it through: The Precision-Recall Trade-offOperational Metric Tuning0/3 stages

PROBLEMTune a transaction fraud detection classifier's threshold. You are given a model that predicts the probability of fraud P(fraud).

1
Establish the error profiles
“What are the two types of prediction errors in classification, and what do they cost in our credit card fraud context?”
2
Trace the threshold movement
“If we decrease the classification threshold from 0.5 to 0.1, what happens to the number of FPs and FNs? How do precision and recall shift?”
unlocks after the stage above
3
Optimize for expected financial cost
“If each FN costs ₹5,000 in fraud loss, and each FP costs ₹50 in support calls, where do we set the threshold?”
unlocks after the stage above

Common mistakes

Skipping the baseline — majority class for classification, mean for regression, last-value for time series. No baseline, no claim.
Data leakage — a feature that encodes the answer ("account_closed_date" predicting churn), or normalizing/tuning using test data. Symptom: results too good; outcome: production collapse. The time-series version — random splits letting the model train on the future — is the classic.
Accuracy on imbalanced classes — the 99%-useless trap above.
Jumping to deep learning on 5,000 tabular rows — XGBoost in seconds beats an underfed network in hours (pattern fever, ML edition).
Ignoring calibration — a model saying "90%" should be right ~90% of those times if you're using the probabilities (for pricing or ranking), not just the argmax.
One metric, no slices — fine on average, broken for one region or segment (the observability instinct: percentiles and slices, not means).

Interactive Quiz

Check yourself0/3 answered

Practice

The full loop in 30 lines: scikit-learn's Titanic or churn dataset — baseline, logistic regression, random forest; train/val/test split; confusion matrix; one paragraph: which model ships and why.
Gini Split Lab: Run the custom Python Gini split solver code provided above. Modify the dataset to have a feature threshold that splits churn labels perfectly and verify that split Gini impurity reaches 0.0.
Threshold lab: for your churn model, sweep the threshold 0.1→0.9; plot precision and recall; mark where you'd operate if a retention call costs ₹100 and a lost customer ₹2,000.
Cluster something real: k-means over your word-frequency vectors from Level 1 or embeddings of 100 news headlines; inspect the groups; change k and watch meaning shift.

Tooling prerequisite: NumPy & Pandas for ML — master DataFrame preprocessing and NumPy array operations before implementing any of the above exercises.

Next: Neural Networks — what happens when the features must be learned too.