Case Study — Provenance-Aware Ingestion

A full HLD walkthrough using LandAI: pull data from third-party sources legally and reliably, wrap every record in provenance, and never present a guess as a fact.

case-studydata-pipelinecompliancereliability

Prompt

Design a pipeline that ingests data from external sources (some with legal restrictions), enriches it, and serves it — such that consumers always know how fresh and trustworthy each value is, and the system never fabricates data.

1. Requirements

Functional: fetch from registered sources → normalize → enrich (derive scores) → store → serve; expose freshness/health. Non-functional: legal compliance (respect ToS/robots), reliability (rate limits, retries, failover), and honesty (explicit "unavailable" instead of fake values).

2. High-level design

3. Deep dives

Compliance gate. A SOURCE_REGISTRY records each source's license and legality; a RobotsGate makes prohibited sources (scraping-banned listing portals) refuse to run by design. A licensed feed plugs in as a new permitted source — compliance is structural, not a code comment.

Reliable fetching. Per-host rate limiting, retry with backoff that honors Retry-After, endpoint failover, and an on-disk TTL cache so a flaky upstream doesn't take you down and you don't hammer it.

The provenance envelope. Every record carries source · license · fetched_at · confidence · freshness_score. If a source is down, the API returns available: false — never a fabricated number.

The principle that sells this design

"Never show a dead value as if it were live." Each datapoint is labeled Real / Curated / Heuristic / Simulated, so the UI can show a freshness/confidence badge. Designing for honesty under failure is a rare and strong signal.

4. Scaling it up

  • Replace the per-process TTL file cache with Redis; add a CDN for read API responses.
  • Move fetching to a queue + scheduled workers (one job per source), with a DLQ for sources that keep failing.
  • Make enrichment idempotent and keyed by (source, entity, fetched_at) so reruns don't double-count.
  • Add observability: per-source success rate, freshness SLOs, alert when a source goes stale.

5. Trade-offs

Honesty over coverage (we'd rather say "unavailable" than guess); compliance over completeness (we won't scrape banned sources even if it'd add data); caching freshness vs. load (TTL tuned per source volatility).

Design drills

Ingestion is where data quality and reliability are won. Drill the seams.

Design drills: Ingestion pipeline0/4 done

Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.

Warm-up

Lay out the stages of an ingestion pipeline from external source to queryable store.

Core

Make every stage safe to re-run after a crash or replay.

Core

A source is rate-limited and occasionally returns malformed data. Handle both without poisoning downstream.

Stretch

Backfill two years of history without disrupting live ingestion.