Prompt
Design a pipeline that ingests data from external sources (some with legal restrictions), enriches it, and serves it — such that consumers always know how fresh and trustworthy each value is, and the system never fabricates data.
1. Requirements
Functional: fetch from registered sources → normalize → enrich (derive scores) → store → serve; expose freshness/health. Non-functional: legal compliance (respect ToS/robots), reliability (rate limits, retries, failover), and honesty (explicit "unavailable" instead of fake values).
2. High-level design
3. Deep dives
Compliance gate. A SOURCE_REGISTRY records each source's license and
legality; a RobotsGate makes prohibited sources (scraping-banned listing
portals) refuse to run by design. A licensed feed plugs in as a new
permitted source — compliance is structural, not a code comment.
Reliable fetching. Per-host rate limiting, retry with backoff that
honors Retry-After, endpoint failover, and an on-disk TTL cache so a
flaky upstream doesn't take you down and you don't hammer it.
The provenance envelope. Every record carries
source · license · fetched_at · confidence · freshness_score. If a source is
down, the API returns available: false — never a fabricated number.
"Never show a dead value as if it were live." Each datapoint is labeled Real / Curated / Heuristic / Simulated, so the UI can show a freshness/confidence badge. Designing for honesty under failure is a rare and strong signal.
4. Scaling it up
- Replace the per-process TTL file cache with Redis; add a CDN for read API responses.
- Move fetching to a queue + scheduled workers (one job per source), with a DLQ for sources that keep failing.
- Make enrichment idempotent and keyed by
(source, entity, fetched_at)so reruns don't double-count. - Add observability: per-source success rate, freshness SLOs, alert when a source goes stale.
5. Trade-offs
Honesty over coverage (we'd rather say "unavailable" than guess); compliance over completeness (we won't scrape banned sources even if it'd add data); caching freshness vs. load (TTL tuned per source volatility).
Design drills
Ingestion is where data quality and reliability are won. Drill the seams.
Whiteboard each one out loud for 5–10 minutes before you reveal what a strong answer covers — the gap between your sketch and the checklist is your study list. Progress is saved on this device.
Lay out the stages of an ingestion pipeline from external source to queryable store.
Make every stage safe to re-run after a crash or replay.
A source is rate-limited and occasionally returns malformed data. Handle both without poisoning downstream.
Backfill two years of history without disrupting live ingestion.