Large Language Models (LLMs)

How LLMs work — from tokens to transformers to in-context learning. What prompting, temperature, and context windows mean, and how to build production applications with them.

What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained on vast amounts of text to predict the next token. That's it — everything else (reasoning, coding, translation, summarization) emerges from doing this prediction task extremely well at massive scale.

The Fortune Teller's Crystal Ball (Next-Token Probability): Imagine a fortune teller who predicts the future one single character at a time. If you write: "The cat sat on the...", she looks in her crystal ball and sees floating words with percentage scores: "mat" (62%), "floor" (18%), "rug" (8%), "computer" (0.1%). Based on a roll of a die (the temperature), she picks one word, writes it down, and the process starts all over again. The LLM is that fortune teller, scaled up to predict 50,000 possible subword options (tokens) at a time.
The Well-Read Scholar (World Knowledge): Think of the model parameters as the memory paths of a scholar who has read all human books, code repositories, websites, and dialogues. When asked a question, she draws on these paths to generate the most coherent, likely completion.

Why "large"? Scale is the key ingredient:

GPT-3: 175 billion parameters
GPT-4: ~1 trillion parameters (estimated)
Trained on trillions of tokens (words/pieces)
Requires months on thousands of GPUs

Tokens — The Unit of LLMs

LLMs don't process words or characters — they process tokens. Tokens are common subword units:

"Hello, World!" → ["Hello", ",", " World", "!"]           = 4 tokens
"ChatGPT"       → ["Chat", "G", "PT"]                    = 3 tokens
"tokenization"  → ["token", "ization"]                    = 2 tokens
"đ"            → ["å", "đ"]                              = 2 tokens (non-ASCII splits more)

Rule of thumb:
  1 token ≈ 4 characters ≈ 0.75 words (for English)
  100 tokens ≈ 75 words

Why tokens matter practically:

Pricing: APIs charge per token (input + output)
Context window: models have a max token limit (4K to 1M+)
Weird behavior: "strawberry has 3 r's" — the model doesn't see letters, it sees tokens

Python

# Using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, how many tokens is this?")
print(f"Token count: {len(tokens)}")   # 8
print(f"Tokens: {[enc.decode([t]) for t in tokens]}")

How LLMs Work (Conceptually)

Pre-training — Learning Language

Objective: predict the next token
Input:  "The cat sat on the"
Target: "mat"

Input:  "def fibonacci(n): if n <= 1: return n return fibonacci("
Target: "n-1"

Repeat for TRILLIONS of examples from the internet, books, code
→ Model learns grammar, facts, reasoning, coding, math

The model has parameters (billions of numbers) that are adjusted during training to make better predictions. After training, these parameters are frozen.

The Transformer Architecture

Modern LLMs use the Transformer architecture (2017, "Attention Is All You Need"):

┌──────────────────────────────────────────────────────────────┐
│                       Transformer                             │
│                                                               │
│  Input tokens: ["The", "cat", "sat", "on", "the"]            │
│       ↓                                                       │
│  Embedding Layer (token → dense vector)                       │
│       ↓                                                       │
│  Positional Encoding (add position information)               │
│       ↓                                                       │
│  ┌─────────────────────────────────────────┐                 │
│  │         Self-Attention Layer             │  × N layers    │
│  │   (each token attends to all others)    │                 │
│  │         Feed-Forward Layer              │                 │
│  └─────────────────────────────────────────┘                 │
│       ↓                                                       │
│  Output: probability distribution over all tokens             │
│  → most likely next token: "mat"                             │
└──────────────────────────────────────────────────────────────┘

Self-attention is the key innovation: each token can "look at" all other tokens to understand context. "bank" in "river bank" vs "national bank" — attention lets the model distinguish based on surrounding words.

RLHF — Making Models Helpful

Pre-trained models predict text — they don't naturally follow instructions or avoid harmful content. RLHF (Reinforcement Learning from Human Feedback) fixes this:

Step 1: Supervised Fine-Tuning (SFT)
  Human experts write ideal (prompt, response) pairs
  Fine-tune the base model on these examples

Step 2: Reward Model Training
  Show humans pairs of model responses
  Humans pick which is better
  Train a reward model that predicts human preference

Step 3: RL with PPO
  Use the reward model to score outputs
  Fine-tune the LLM to produce higher-reward responses
  → Model learns to be helpful, harmless, honest

This is how ChatGPT, Claude, and Gemini are turned from raw language models into helpful assistants.

Key Concepts for Developers

Context Window

The context window is the maximum number of tokens an LLM can process at once (input + output combined).

Model	Context Window
GPT-3.5-turbo	16K tokens (~12K words)
GPT-4o	128K tokens (~96K words)
Claude 3.5 Sonnet	200K tokens (~150K words)
Gemini 1.5 Pro	1M tokens (~750K words)

Everything outside the context window is "forgotten." Long document Q&A requires chunking and retrieval (see RAG).

Temperature and Sampling

Temperature controls randomness in the model's output:

Python

# Temperature = 0: always picks the most likely token (deterministic, focused)
# Best for: code, structured data, factual answers

# Temperature = 1: samples proportionally from the distribution (balanced)
# Best for: most conversational use cases

# Temperature > 1: more random, creative, surprising
# Best for: creative writing, brainstorming

# Top-P (nucleus sampling): only sample from the top P% of probability mass
# top_p=0.9: only consider tokens that together make up 90% of probability

import openai

client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about databases"}],
    temperature=0.9,    # creative
    top_p=0.95,
    max_tokens=200,
)
print(response.choices[0].message.content)

Here is a complete Python implementation of a temperature-scaled decoding sampler showing greedy, Top-K, and Top-P (nucleus) selection over raw vocabulary logits:

Python

import numpy as np
from typing import List, Tuple, Optional

def decode_next_token(
    logits: np.ndarray, 
    temperature: float = 1.0, 
    top_k: int = 0, 
    top_p: float = 0.0
) -> Tuple[int, np.ndarray]:
    # logits shape: (vocab_size,)
    
    # 1. Handle Greedy Decoding (zero temperature)
    if temperature == 0.0:
        idx = int(np.argmax(logits))
        probs = np.zeros_like(logits)
        probs[idx] = 1.0
        return idx, probs
        
    # Apply temperature scaling
    scaled_logits = logits / temperature
    
    # 2. Apply Top-K filtering
    if top_k > 0:
        indices_to_remove = scaled_logits < np.sort(scaled_logits)[-top_k]
        scaled_logits[indices_to_remove] = -1e9  # mask out
        
    # Convert logits to probabilities via Softmax
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    probs = exp_logits / np.sum(exp_logits)
    
    # 3. Apply Top-P (Nucleus) filtering
    if 0.0 < top_p < 1.0:
        sorted_indices = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_indices]
        cumulative_probs = np.cumsum(sorted_probs)
        
        # Identify elements exceeding top_p threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Shift mask to keep the first token exceeding top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].copy()
        sorted_indices_to_remove[0] = False
        
        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        probs[indices_to_remove] = 0.0
        probs = probs / np.sum(probs)  # re-normalize
        
    # Sample from the final distribution
    sampled_idx = int(np.random.choice(len(probs), p=probs))
    return sampled_idx, probs

# Example logits for 5 vocabulary tokens: ["the", "cat", "sat", "mat", "door"]
vocab = ["the", "cat", "sat", "mat", "door"]
mock_logits = np.array([2.0, 1.5, 0.5, 3.0, -1.0])

# Test: low temperature (focused), Top-K=3
idx, final_probs = decode_next_token(mock_logits, temperature=0.5, top_k=3)
print("Vocabulary probabilities after filtering:\n", dict(zip(vocab, final_probs)))
print("Sampled next token:", vocab[idx])

Think it through like the interview

Don't just set parameters randomly — derive the correct decoding configurations from product requirements.

Think it through: LLM Decoding StrategiesHyperparameter Selection0/3 stages

PROBLEMConfigure temperature, top-p, and top-k for three different tasks: (1) generating strict JSON outputs, (2) writing marketing copy, and (3) coding helper functions.

1
Task 1: Structured JSON output
“For generating a valid, strict JSON string, what should the temperature be, and why?”
2
Task 2: Creative marketing copy
“To write engaging, creative ad copies, how do we balance temperature and top-p to prevent the model from repeating generic, boring phrasing?”
unlocks after the stage above
3
Task 3: Code generation
“For code generation, we want accurate logic but might need alternative implementations. Why is Top-K helpful here?”
unlocks after the stage above

System Prompts

The system prompt sets the LLM's behavior and persona for the entire conversation:

Python

messages = [
    {
        "role": "system",
        "content": """You are PrepDeck AI, a world-class software engineering mentor.
Your goal is to help engineers crack FAANG interviews.
Always explain concepts from first principles.
When explaining algorithms, always include:
1. Intuition and analogy
2. Step-by-step breakdown
3. Code in Python
4. Time and space complexity
5. Common interview variations

Be encouraging but technically rigorous. Never give wrong information."""
    },
    {
        "role": "user",
        "content": "Explain binary search"
    }
]

Prompt Engineering

Prompt engineering is the art of crafting inputs to get better outputs.

Python

# ❌ Vague prompt
"Explain sorting"

# ✅ Specific, structured prompt
"""
Explain merge sort for a software engineering interview.
Include:
1. The problem it solves (and why quicksort isn't always better)
2. The divide-and-conquer intuition with an example
3. Python implementation
4. Time complexity: O(n log n) — explain why
5. Space complexity and when that matters
6. One trick interview question about merge sort
Keep the response under 500 words."""

# ✅ Chain-of-thought prompting (for complex reasoning)
"""
A train leaves station A at 9am going 60mph.
Another train leaves station B at 10am going 90mph.
The stations are 300 miles apart. When do they meet?

Think step by step:"""

# ✅ Few-shot prompting (show examples)
"""
Classify the sentiment:
Input: "This product is amazing!" → Positive
Input: "Terrible quality, broke in a day" → Negative
Input: "It's okay I guess" → Neutral
Input: "Never buying from here again!" → """

Building with LLMs

Calling the OpenAI API

Python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simple completion
def ask(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful coding mentor."},
            {"role": "user", "content": question}
        ],
        temperature=0.7,
        max_tokens=1000,
    )
    return response.choices[0].message.content

# Streaming (show text as it's generated — better UX)
def ask_streaming(question: str):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
        stream=True,
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

Multi-Turn Conversations

Python

class ConversationManager:
    def __init__(self, system_prompt: str):
        self.messages = [{"role": "system", "content": system_prompt}]
        self.max_history = 20  # limit history to control costs

    def chat(self, user_message: str) -> str:
        self.messages.append({"role": "user", "content": user_message})

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=self.messages[-self.max_history:],  # sliding window
        )
        assistant_message = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": assistant_message})
        return assistant_message

Structured Output — JSON Mode

Python

from pydantic import BaseModel
from typing import List

class InterviewQuestion(BaseModel):
    question: str
    difficulty: str   # "easy" | "medium" | "hard"
    topic: str
    hints: List[str]
    time_complexity: str

# OpenAI structured output (guaranteed valid JSON)
response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": "Generate a binary search interview question"
    }],
    response_format=InterviewQuestion,
)
question = response.choices[0].message.parsed
print(f"Q: {question.question}")
print(f"Difficulty: {question.difficulty}")
print(f"Hints: {question.hints}")

Function Calling / Tool Use

LLMs can call external functions/APIs:

Python

import json

# Define tools the LLM can call
tools = [{
    "type": "function",
    "function": {
        "name": "search_documentation",
        "description": "Search the PrepDeck documentation for a topic",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "section": {
                    "type": "string",
                    "enum": ["dsa", "hld", "lld", "backend"],
                    "description": "Which section to search"
                }
            },
            "required": ["query"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Find articles about hash tables"}],
    tools=tools,
    tool_choice="auto",
)

# Handle tool calls
if response.choices[0].message.tool_calls:
    for tool_call in response.choices[0].message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = search_documentation(**args)   # your actual function
        # Add result back to conversation...

Production Considerations

Cost management:
  - Use smaller models for simple tasks (gpt-4o-mini over gpt-4o)
  - Cache repeated responses (semantic caching with vector DB)
  - Limit context window usage

Latency:
  - Streaming responses feel faster (first token appears quickly)
  - Parallel calls when queries are independent

Reliability:
  - Retry with exponential backoff on rate limit errors
  - Fallback to smaller models on failures
  - Set max_tokens to prevent runaway costs

Safety:
  - Input validation (don't let users inject system prompt overrides)
  - Output filtering (content moderation API)
  - Never expose your system prompt directly to users

LLM Model Comparison

When building LLM-powered applications, selecting the right model involves balancing latency, cost, context window, and capability. Below is a comparison of major models used in industry:

Model	Provider	Context Window	Strengths	Cost Class	Ideal Use Cases
GPT-4o	OpenAI	128k	High-reasoning, multimodal, speed, structured output reliability	Premium	Complex reasoning, agent planning, tool usage, multi-step tasks
Claude 3.5 Sonnet	Anthropic	200k	Coding, technical writing, complex analytical reasoning, XML parsing	Premium	Software engineering automation, document analysis, data extraction
Gemini 1.5 Pro	Google	2M	Massive context window, native multimodal (video, audio), needle-in-a-haystack accuracy	Premium	Long document research, full codebase understanding, video analysis
Gemini 1.5 Flash	Google	1M	High speed, large context, low cost, multimodal	Budget	Fast summarization, simple structured data extraction, real-time RAG
Llama 3 (70B/405B)	Meta (Open Source)	128k	Self-hostable, custom fine-tuning, competitive reasoning	Variable (Hosting)	Private cloud deployments, domain-specific fine-tuning, offline processing

Open-Source vs. Proprietary LLMs

An essential architectural choice is deciding between proprietary (API-based) and open-source (self-hosted) models:

1. Proprietary LLMs (e.g., OpenAI, Anthropic, Google APIs)

Pros: No infrastructure overhead, state-of-the-art capability, managed safety filters, pay-per-token pricing (cheap for low volume).
Cons: Data privacy concerns (sending sensitive data to third parties), rate limits, model deprecation (APIs change or retired), vendor lock-in.

2. Open-Source LLMs (e.g., Meta's Llama 3, Mistral, Microsoft's Phi-3)

Pros: Full control over data privacy (runs inside your VPC), weights are yours forever, customizable vocabulary, option to fine-tune on domain-specific datasets, deterministic response times (no public API congestion).
Cons: High infrastructure cost (GPU instances like A100/H100 are expensive), operational overhead (setup, scaling, cold starts), slightly lag behind the absolute best proprietary models.

Embedding Models & Vector Space

While generative models (LLMs) produce text, Embedding Models convert text into numerical representations (dense vectors) in a high-dimensional space. These vectors capture the semantic meaning of the text.

How Embeddings Work

If two sentences are semantically similar (e.g., "How do I reset my password?" and "I forgot my login credentials"), their vectors will be close together in vector space, even if they share zero identical words. Conversely, semantically different sentences (e.g., "How do I reset my password?" and "What is the capital of France?") will point in completely different directions.

Python

# Conceptualizing Vector Distance
import numpy as np

# A 3-dimensional mock vector representation
vector_reset_password = np.array([0.9, 0.1, -0.05])
vector_forgot_login   = np.array([0.85, 0.15, -0.02])
vector_capital_france = np.array([-0.2, 0.7, 0.8])

# Cosine similarity helper
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print(f"Similarity (Reset vs Forgot): {cosine_similarity(vector_reset_password, vector_forgot_login):.4f}")
# Outputs close to 1.0 (highly similar)
print(f"Similarity (Reset vs France): {cosine_similarity(vector_reset_password, vector_capital_france):.4f}")
# Outputs close to 0.0 or negative (unrelated)

Common Distance Metrics

To perform search over embeddings (e.g., in RAG), we calculate the distance between the query vector and document vectors:

Cosine Similarity: Measures the cosine of the angle between two vectors. Value ranges from -1 to 1. Ignores vector magnitude (useful when text lengths vary).
Dot Product: Measures projection of one vector onto another. Value depends on magnitude. If vectors are normalized to unit length (magnitude of 1), Dot Product is mathematically identical to Cosine Similarity and is highly optimized on GPUs.
Euclidean Distance (L2): Measures the straight-line distance between two points in space. Smaller values mean higher similarity.

Hallucinations: Causes & Mitigation

Hallucination is when an LLM generates text that is factually incorrect, logically inconsistent, or completely fabricated, yet presented with high confidence.

Root Causes

Objective Function Bias: LLMs are trained to predict the next most probable token, not the most truthful token. They default to pleasing syntax over truth.
Information Gap / Lack of Source: The model tries to generate details it didn't fully capture during training or that are out of its parameters.
Attention Decay / Context Window Saturation: In long inputs, key instructions can be lost, causing the model to default to generic training data distributions.
Imperfect Tokenization: Numbers or characters are split in ways that confuse arithmetic reasoning (e.g., treating 1024 as two unrelated tokens 10 and 24).

Mitigation Techniques

Retrieval-Augmented Generation (RAG): Do not rely on internal memory for facts. Retrieve external documents and inject them into the prompt, forcing the model to cite its sources.
Temperature Control: Lower the temperature parameter to 0.0 or 0.1 for factual tasks. This forces the model to select only the highest-probability tokens.
Structured Output (Schema Enforcement): Force the model to output JSON adhering to a strict schema (e.g., using Pydantic in OpenAI's API) to prevent it from outputting generic conversational prose.
Self-Consistency / Chain-of-Thought (CoT): Ask the model to "think step-by-step" before writing the answer. Generate multiple reasoning paths and take a majority vote on the final answer.
Critique and Refine (Self-Correction): Programmatically prompt the LLM to review its own output against constraints and correct errors before returning the response to the user.

Basic: Build a command-line chatbot with conversation history using the OpenAI API. Limit history to the last 10 messages.
Decoding Lab: Run the custom Python decoding sampler code provided above. Test changing logits and setting top_p or top_k values. Verify how the vocabulary probability values change after applying nucleus vs top-k thresholds.
Structured Output: Create a tool that takes a job description and returns structured data: required skills, seniority level, company stage, and estimated salary range.
Function Calling: Build an LLM that can answer questions about your local codebase by calling a read_file and search_files function.
Streaming UI: Build a Next.js chat interface that streams the LLM response token by token using Server-Sent Events.

Next: RAG — augmenting LLMs with your own data.

Large Language Models (LLMs)

What is a Large Language Model?

Tokens — The Unit of LLMs

How LLMs Work (Conceptually)

Pre-training — Learning Language

The Transformer Architecture

RLHF — Making Models Helpful

Key Concepts for Developers

Context Window

Temperature and Sampling

Think it through like the interview

System Prompts

Prompt Engineering

Building with LLMs

Calling the OpenAI API

Multi-Turn Conversations

Structured Output — JSON Mode

Function Calling / Tool Use

Production Considerations

LLM Model Comparison

Open-Source vs. Proprietary LLMs

1. Proprietary LLMs (e.g., OpenAI, Anthropic, Google APIs)

2. Open-Source LLMs (e.g., Meta's Llama 3, Mistral, Microsoft's Phi-3)

Embedding Models & Vector Space

How Embeddings Work

Common Distance Metrics

Hallucinations: Causes & Mitigation

Root Causes

Mitigation Techniques

Common Interview Questions

Interactive Quiz

Practice