Neural Networks · PrepDeck

From single perceptrons to deep networks — how neural networks learn, backpropagation, activation functions, CNNs for images, and building your first network in Python.

What is a Neural Network?

A neural network is a computational system loosely inspired by biological brains — interconnected layers of simple units (neurons) that together can learn extremely complex patterns.

Analogy: A neural network is like a factory assembly line with workers.
Layer 1 workers: "Is there an edge here?" (detect simple features)
Layer 2 workers: "Are these edges forming a shape?" (combine features)
Layer 3 workers: "Does this shape look like a cat?" (classify)

Each worker learns their job by seeing millions of examples
and being corrected when they get it wrong.

The Perceptron — Building Block

The simplest neuron: takes inputs, multiplies by weights, sums them up, applies an activation function:

Inputs:   x₁ = 0.5  (pixel value)
          x₂ = 0.8  (another pixel)
          x₃ = 0.2

Weights:  w₁ = 0.3
          w₂ = -0.5
          w₃ = 0.8

Bias:     b = 0.1

Weighted sum: z = (0.5×0.3) + (0.8×-0.5) + (0.2×0.8) + 0.1
                = 0.15 - 0.40 + 0.16 + 0.10
                = 0.01

Activation: output = ReLU(0.01) = max(0, 0.01) = 0.01

Python

import numpy as np
from typing import Tuple

class Neuron:
    def __init__(self, n_inputs: int):
        # Weights initialized randomly (small values)
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        
        # Cache inputs and net input for backprop
        self.last_inputs = None
        self.last_z = 0.0

    def forward(self, inputs: np.ndarray) -> float:
        self.last_inputs = np.array(inputs)
        self.last_z = np.dot(self.last_inputs, self.weights) + self.bias
        return max(0.0, self.last_z)  # ReLU activation: f(z) = max(0, z)

    def backward(self, d_out: float) -> Tuple[np.ndarray, float]:
        # Derivative of ReLU: 1 if z > 0 else 0
        d_relu = 1.0 if self.last_z > 0 else 0.0
        d_z = d_out * d_relu
        
        # Gradients: ∂Loss/∂w = d_z * inputs, ∂Loss/∂b = d_z
        d_weights = d_z * self.last_inputs
        d_bias = d_z
        return d_weights, d_bias

neuron = Neuron(n_inputs=3)
output = neuron.forward([0.5, 0.8, 0.2])
d_w, d_b = neuron.backward(d_out=1.0)

Activation functions introduce non-linearity — without them, stacking layers would just be matrix multiplication (equivalent to one layer). Non-linearity is what lets networks learn complex patterns.

Python

import numpy as np

# ReLU (Rectified Linear Unit) — most common for hidden layers
# Simple: output = max(0, x)
# Fast, doesn't have vanishing gradient problem
def relu(x): return np.maximum(0, x)
# relu(-1) = 0, relu(0.5) = 0.5, relu(3) = 3

# Leaky ReLU — allows small negative values (prevents "dead ReLU" problem)
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)

# Sigmoid — squashes output to (0, 1)
# Used for binary classification output
# Has vanishing gradient problem at extremes
def sigmoid(x): return 1 / (1 + np.exp(-x))
# sigmoid(-10) ≈ 0, sigmoid(0) = 0.5, sigmoid(10) ≈ 1

# Tanh — squashes to (-1, 1)
# Better than sigmoid for hidden layers (zero-centered)
def tanh(x): return np.tanh(x)

# Softmax — converts a vector of logits to probabilities that sum to 1
# Used for multi-class classification output
def softmax(x):
    e_x = np.exp(x - np.max(x))  # numerical stability
    return e_x / e_x.sum()
# softmax([2, 1, 0.1]) ≈ [0.659, 0.242, 0.099]

When to use what:

Hidden layers: ReLU (default) or variants (LeakyReLU, GELU for transformers)
Binary output: Sigmoid (outputs probability 0-1)
Multi-class output: Softmax (outputs class probabilities summing to 1)
Regression output: Linear (no activation — any real number)

Multi-Layer Network (Forward Pass)

Python

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        # layer_sizes = [784, 128, 64, 10] for MNIST digit classification
        self.weights = []
        self.biases = []
        for i in range(len(layer_sizes) - 1):
            # He initialization (good for ReLU)
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2/layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(W)
            self.biases.append(b)

    def relu(self, x):
        return np.maximum(0, x)

    def softmax(self, x):
        e = np.exp(x - x.max(axis=1, keepdims=True))
        return e / e.sum(axis=1, keepdims=True)

    def forward(self, X):
        self.activations = [X]
        current = X
        for W, b in zip(self.weights[:-1], self.biases[:-1]):
            z = current @ W + b
            current = self.relu(z)
            self.activations.append(current)
        # Output layer: softmax for classification
        z = current @ self.weights[-1] + self.biases[-1]
        output = self.softmax(z)
        self.activations.append(output)
        return output

nn = NeuralNetwork([784, 128, 64, 10])
X = np.random.randn(32, 784)  # batch of 32 images, 784 pixels each
output = nn.forward(X)        # shape: (32, 10) — probability for each digit

How Networks Learn — Backpropagation

Training a network = finding weights that minimize prediction error.

Loss Functions

Python

# Cross-entropy loss for classification (measures how wrong predictions are)
def cross_entropy_loss(y_pred, y_true):
    # y_pred: predicted probabilities (after softmax)
    # y_true: true labels (one-hot encoded or class indices)
    eps = 1e-8  # prevent log(0)
    n = y_true.shape[0]
    log_likelihood = -np.log(y_pred[range(n), y_true] + eps)
    return np.mean(log_likelihood)

# MSE loss for regression
def mse_loss(y_pred, y_true):
    return np.mean((y_pred - y_true) ** 2)

Gradient Descent

Python

# The training loop
learning_rate = 0.01

for epoch in range(100):
    # 1. Forward pass: compute predictions
    y_pred = model.forward(X_train)

    # 2. Compute loss
    loss = cross_entropy_loss(y_pred, y_train)

    # 3. Backward pass: compute gradients (calculus chain rule)
    # How much does each weight contribute to the loss?
    gradients = model.backward(loss)

    # 4. Update weights: move in direction that reduces loss
    for W, dW in zip(model.weights, gradients):
        W -= learning_rate * dW

Backpropagation (backprop) computes gradients using the chain rule of calculus — it propagates the error signal backward through each layer:

Forward:  input → layer1 → layer2 → output → loss
Backward: loss → ∂loss/∂layer2 → ∂loss/∂layer1 → ∂loss/∂weights

In practice, you never implement backprop manually — frameworks do it automatically.

PyTorch — Building Networks in Practice

Python

import torch
import torch.nn as nn
import torch.optim as optim

# Define the network
class MNISTClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(784, 256),   # input layer → hidden layer
            nn.ReLU(),
            nn.Dropout(0.3),       # randomly zero out 30% of neurons during training
            nn.Linear(256, 128),   # hidden → hidden
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10),    # hidden → output (10 classes)
        )

    def forward(self, x):
        return self.network(x)

# Initialize
model = MNISTClassifier()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Training loop
model.train()
for epoch in range(20):
    total_loss = 0
    correct = 0

    for X_batch, y_batch in train_loader:
        # Forward pass
        logits = model(X_batch)           # (batch_size, 10)
        loss = criterion(logits, y_batch)

        # Backward pass
        optimizer.zero_grad()             # clear previous gradients
        loss.backward()                   # compute gradients
        optimizer.step()                  # update weights

        total_loss += loss.item()
        correct += (logits.argmax(1) == y_batch).sum().item()

    print(f"Epoch {epoch}: loss={total_loss/len(train_loader):.4f}, "
          f"acc={correct/len(train_dataset):.4f}")

# Evaluation
model.eval()
with torch.no_grad():                     # no gradient computation needed
    test_logits = model(X_test)
    predictions = test_logits.argmax(1)
    accuracy = (predictions == y_test).float().mean()
    print(f"Test accuracy: {accuracy:.4f}")

Convolutional Neural Networks (CNNs)

CNNs are specialized for grid-like data — images, audio spectrograms. Instead of connecting every pixel to every neuron, they use convolutional filters that slide across the image:

Image: 28×28 pixels
Filter (kernel): 3×3 weights
Convolution: slide the filter across the image, compute dot product at each position
Output: feature map (detects edges, textures, etc.)

Python

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_layers = nn.Sequential(
            # (batch, 1, 28, 28) → (batch, 32, 26, 26)
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3),
            nn.ReLU(),
            # Pooling: reduce spatial dimensions by taking max in 2×2 windows
            # (batch, 32, 26, 26) → (batch, 32, 13, 13)
            nn.MaxPool2d(kernel_size=2),

            # (batch, 32, 13, 13) → (batch, 64, 11, 11)
            nn.Conv2d(32, 64, kernel_size=3),
            nn.ReLU(),
            # (batch, 64, 11, 11) → (batch, 64, 5, 5)
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),              # (batch, 64, 5, 5) → (batch, 1600)
            nn.Linear(64 * 5 * 5, 128),
            nn.ReLU(),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.conv_layers(x)
        return self.classifier(x)

Why CNNs work for images:

Parameter sharing: one filter learns one feature type (edges in any location)
Translation invariance: a cat detector works whether the cat is left or right
Hierarchical features: early layers detect edges → middle layers detect shapes → late layers detect objects

Recurrent Neural Networks (RNNs)

RNNs are specialized for sequential data — text, time-series, audio, where the order of elements matters. Unlike feedforward networks (ANNs) that process each input independently, RNNs pass a hidden state from one step to the next, acting as a memory:

Unrolled RNN:
x₁ ──→ [ RNN Cell ] ──→ h₁
             ↓ (passes hidden state)
x₂ ──→ [ RNN Cell ] ──→ h₂
             ↓
x₃ ──→ [ RNN Cell ] ──→ h₃

Python

import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        # Recurrent layer
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        # Fully connected output layer
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        # out shape: (batch, seq_len, hidden_size)
        # h_n shape: (1, batch, hidden_size) - final hidden state
        out, h_n = self.rnn(x)
        
        # Classify based on the final token's hidden state
        return self.fc(out[:, -1, :])

The Vanishing Gradient Problem in RNNs

Standard RNNs struggle with long sequences. During backpropagation, gradients are multiplied back through every time step (Backpropagation Through Time). If the sequence is long, multiplying by weights repeatedly causes gradients to either vanish to zero (losing long-term context) or explode to infinity.

The Fix: LSTMs & GRUs

To solve this, specialized architectures introduce gates to control how information is stored, updated, and forgotten:

LSTM (Long Short-Term Memory): Uses a dedicated Cell State ($c_t$) as an internal highway, protected by three gates:
1. Forget Gate: Decides what to throw away from the past.
2. Input Gate: Decides what new information to write.
3. Output Gate: Decides what to output as the next hidden state.
GRU (Gated Recurrent Unit): A simpler, faster version combining cell and hidden states into one, using only two gates (Reset and Update).

Python

# In PyTorch, swap nn.RNN for nn.LSTM or nn.GRU:
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)

def forward(self, x):
    # LSTMs return both final hidden state (h_n) and cell state (c_n)
    out, (h_n, c_n) = self.lstm(x)
    return self.fc(h_n[-1])

Think it through like the interview

Don't just state that Sigmoids vanish — walk through the chain rule mechanics that cause it.

Think it through: The Vanishing Gradient ProblemDeep Network Optimization0/4 stages

PROBLEMDiagnose why a 10-layer network utilizing Sigmoid activation functions stops learning after the first few training steps.

1
Trace the activation derivatives
“What is the derivative of the Sigmoid function σ(x) = 1 / (1 + e⁻ˣ), and what is its maximum possible value?”
2
Apply the Chain Rule back through layers
“Using the chain rule, how is the gradient of the loss with respect to the weights in the 1st layer calculated in a 10-layer network?”
unlocks after the stage above
3
Evaluate the product decay
“If each activation derivative term is at most 0.25, what happens to the product of 10 of these derivatives as gradients flow from the output back to layer 1?”
unlocks after the stage above
4
Nudge to ReLU and Skip Connections
“How do ReLU activations and residual (skip) connections structurally resolve this decay?”
unlocks after the stage above

Model Architectures Compared

Architecture	Key Concept	Input Type	Best For
ANN / MLP	Fully-connected layers	Fixed-size vectors	Tabular data, simple classification
CNN	Convolutional filters, weight sharing	Grid-like data (2D/3D)	Images, video, audio spectrograms
RNN / LSTM	Recurrent hidden state loop	Sequential data (1D)	Time-series, text, translation, audio

Hyperparameters

Key settings to tune:

Python

# Learning rate: how big the gradient descent steps are
# Too high: oscillates, doesn't converge
# Too low: trains very slowly
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Batch size: how many samples per weight update
# Large: more stable gradients, faster on GPU; requires more memory
# Small: noisier gradients, can generalize better
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Epochs: how many times to see the full training set
# Use early stopping: monitor validation loss and stop when it stops improving

# Learning rate scheduling: reduce LR when training plateaus
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)

# Dropout: randomly zero out activations during training (regularization)
nn.Dropout(p=0.5)  # p = probability of zeroing out

# Weight decay (L2 regularization): penalize large weights
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)

Common Interview Questions

Optimizers — How Gradient Descent Actually Works in Practice

The raw gradient update rule (w = w - lr × gradient) is just vanilla SGD. Modern networks use smarter optimizers:

SGD (Stochastic Gradient Descent):
  w ← w - α × ∂L/∂w
  Simple but slow — takes many steps, sensitive to learning rate.
  Works well with learning rate schedules and momentum.

SGD with Momentum:
  v ← β × v + ∂L/∂w      (accumulate velocity)
  w ← w - α × v
  β=0.9 is common. Like a ball rolling downhill — builds speed in consistent
  directions, dampens oscillations. Much faster than plain SGD.

Adam (Adaptive Moment Estimation):
  m ← β₁ × m + (1-β₁) × ∂L/∂w     (1st moment: mean gradient)
  v ← β₂ × v + (1-β₂) × (∂L/∂w)²  (2nd moment: mean squared gradient)
  w ← w - α × m̂ / (√v̂ + ε)

  β₁=0.9, β₂=0.999, ε=1e-8 by default
  Adapts learning rate per parameter — sparse features get bigger updates.
  Default choice for most deep learning tasks. Rarely fails.

RMSProp:
  v ← β × v + (1-β) × (∂L/∂w)²
  w ← w - α × ∂L/∂w / (√v + ε)
  Good for RNNs. Precursor to Adam.

Python

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

# Optimizer choices and when to use each:
optimizer_sgd   = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
optimizer_adam  = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
optimizer_adamw = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# AdamW = Adam + proper L2 regularization — the default for transformer fine-tuning

# Learning rate schedule: reduce lr when validation loss plateaus
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer_adam, mode='min', factor=0.5, patience=3)

# Training loop
for epoch in range(10):
    train_loss = train_one_epoch(model, optimizer_adam)     # your training function
    val_loss   = evaluate(model)
    scheduler.step(val_loss)    # halves lr if val_loss doesn't improve for 3 epochs
    print(f"Epoch {epoch}: train={train_loss:.4f}, val={val_loss:.4f}, lr={scheduler.get_last_lr()[0]:.6f}")

Interview quick guide:

Optimizer	When to use
SGD + momentum	Image classifiers (ResNet, VGG). More stable final performance.
Adam	Default for everything else — transformers, MLPs, early experiments.
AdamW	Fine-tuning LLMs, BERT, GPT. Adam + correct weight decay.
RMSProp	RNN/LSTM training.

Weight Initialization — Why Starting Values Matter

Randomly initializing weights poorly causes gradients to vanish or explode before training even begins. The goal: keep activation and gradient variance roughly constant across layers.

The problem with random Normal(0, 1) initialization in deep networks:
  Layer 1 output: variance ≈ n × 1.0 (grows with width n)
  → After 10 layers: exponential explosion or collapse of signal

Xavier / Glorot Initialization (for Sigmoid/Tanh):
  std = sqrt(2 / (fan_in + fan_out))
  → designed so signal variance stays ~1.0 through sigmoid/tanh activations

He Initialization (for ReLU):
  std = sqrt(2 / fan_in)
  → designed so variance stays ~1.0 through ReLU (which zeros half the values)
  → PyTorch default for nn.Linear layers when followed by ReLU

Python

import torch
import torch.nn as nn

# Manual initialization
def initialize_layer(layer: nn.Linear, activation: str = 'relu') -> None:
    if activation == 'relu':
        # He initialization: best for ReLU networks
        nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
    elif activation in ('sigmoid', 'tanh'):
        # Xavier initialization: best for sigmoid/tanh networks
        nn.init.xavier_uniform_(layer.weight)
    nn.init.zeros_(layer.bias)  # biases start at 0

model = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Initialize all Linear layers with He init
for layer in model:
    if isinstance(layer, nn.Linear):
        initialize_layer(layer, activation='relu')

# Verify: check output variance after initialization
x = torch.randn(100, 512)
with torch.no_grad():
    out = model(x)
    print(f"Output std: {out.std().item():.4f}")  # should be near 1.0 with good init

Batch Normalization — Stabilizing Deep Networks

Without Batch Norm, deeper networks (>10 layers) are notoriously hard to train — activations drift and gradients vanish. Batch Norm normalizes activations within each mini-batch.

For each mini-batch B = {x₁, x₂, ..., xₘ}:

  1. Mean:      μ_B = (1/m) Σ xᵢ
  2. Variance:  σ²_B = (1/m) Σ (xᵢ - μ_B)²
  3. Normalize: x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)
  4. Scale:     y = γ × x̂ᵢ + β    (γ, β are learned per feature)

Benefits:
  → Each layer receives roughly zero-mean, unit-variance activations
  → Allows higher learning rates (training is more stable)
  → Acts as light regularization (noise from batch statistics)
  → Reduces sensitivity to weight initialization

Where to place it:
  Common pattern: Linear → BatchNorm → Activation → Dropout
  (Some papers prefer: Linear → Activation → BatchNorm — both work)

Python

import torch
import torch.nn as nn

# MLP with Batch Normalization
class MLPWithBatchNorm(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),   # normalize after linear, before activation
            nn.ReLU(),
            nn.Dropout(0.3),

            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.BatchNorm1d(hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.2),

            nn.Linear(hidden_dim // 2, output_dim)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)

model = MLPWithBatchNorm(input_dim=128, hidden_dim=256, output_dim=10)

# CRITICAL: BatchNorm behaves differently in train vs eval mode!
model.train()   # uses batch statistics (mean, std of current batch)
model.eval()    # uses running statistics accumulated during training
# Always call model.eval() during validation and inference!

Transfer Learning — The Most Practical Skill in Deep Learning

Training from scratch requires millions of labeled examples. Transfer learning lets you take a model pre-trained on a large dataset (ImageNet, web text) and adapt it to your specific task with far less data.

The Concept:
  A model trained on ImageNet (1.2M images, 1000 classes) has learned:
  - Layer 1: edge detectors
  - Layer 2: texture patterns
  - Layer 3: part detectors (wheels, faces, wings)
  - Layer 4: object recognizers

  For your medical imaging task, layers 1-3 are immediately useful!
  Only the final "head" needs to be learned from your data.

Three transfer learning strategies:
  1. Feature extraction:  Freeze ALL pre-trained weights, train only the new head
                          → fastest, best when your dataset is small (<1000 examples)
  2. Fine-tuning:        Unfreeze top N layers + train the head
                          → better results, needs more data (~5K+ examples)
  3. Full fine-tuning:   Train all layers at a very low learning rate
                          → best results, needs lots of data and compute

Python

import torch
import torch.nn as nn
import torchvision.models as models

# Load a pre-trained ResNet-50 (trained on ImageNet)
model = models.resnet50(pretrained=True)

# Strategy 1: Feature extraction — freeze everything except the final layer
for param in model.parameters():
    param.requires_grad = False    # freeze all layers

# Replace the final layer for your task (e.g., 2-class binary classification)
num_features = model.fc.in_features          # 2048 for ResNet-50
model.fc = nn.Linear(num_features, 2)        # your new head (trainable)

# Only the new head is trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} / Total: {total_params:,}")  # ~4K / 25M

# Strategy 2: Fine-tuning — unfreeze top layers (layer4 + head)
for name, param in model.named_parameters():
    if 'layer4' in name or 'fc' in name:
        param.requires_grad = True     # unfreeze top ResNet block + head

# Use a smaller learning rate for pre-trained layers to preserve what they learned
optimizer = torch.optim.AdamW([
    {'params': model.layer4.parameters(), 'lr': 1e-5},   # smaller lr for pretrained
    {'params': model.fc.parameters(),     'lr': 1e-3},   # larger lr for new head
])

Transfer learning in NLP (identical concept):

Python

from transformers import AutoModel, AutoTokenizer
import torch.nn as nn

# Load BERT pre-trained on 3.3B words
bert = AutoModel.from_pretrained('bert-base-uncased')

# Freeze BERT, add classification head
for param in bert.parameters():
    param.requires_grad = False

class TextClassifier(nn.Module):
    def __init__(self, bert, num_classes):
        super().__init__()
        self.bert = bert
        self.classifier = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(768, num_classes)  # 768 = BERT hidden size
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_token = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        return self.classifier(cls_token)

Interactive Quiz

Check yourself0/3 answered

Practice

Beginner: Train a neural network on the MNIST dataset (handwritten digits) in PyTorch. Achieve >98% test accuracy using a two-layer MLP.
Neuron Gradient Test: Run the custom Python Neuron implementation from the top of this page. Pass random inputs and verify that running backward correctly updates weight and bias gradients based on whether the net input z was positive or negative.
CNN: Build a CNN for CIFAR-10 (10-class image classification). Use batch normalization and data augmentation. Target: >80% test accuracy.
RNN: Build a simple LSTM model in PyTorch to classify the sentiment of movie reviews (binary classification) or predict the next number in a sequence. Use an embedding layer and experiment with hidden sizes.
Hyperparameter tuning: Starting from a baseline MNIST model, experiment with: learning rate (1e-2, 1e-3, 1e-4), batch size (32, 256), dropout rates (0, 0.3, 0.5). Plot training curves and explain the results.

Next: Transformers & Attention — the architectural revolution behind modern LLMs.