LandAI — Urban Growth Prediction

A full-stack ML platform that predicts where India's land value will rise by matching emerging Tier-3 cities to their historically-similar Tier-2 'twins'. FastAPI + React, four analytical engines, transparency-first.

FastAPIMLXGBoostFAISSPostGISReact

One-liner

LandAI predicts where India's land value will rise before prices move. It finds an emerging Tier-3 city's more-developed "twin", time-shifts that twin's real trajectory onto the target, and forecasts which zones develop over the next 5–10 years — with every number carrying its provenance.

Problem & motivation

Indian land investing is dominated by anecdote and broker incentives. There's plenty of historical price data but no honest, explainable view of future direction for the long tail of Tier-3 cities. LandAI's bet: a city's growth rhymes with a structurally-similar city that developed earlier, so similarity + that twin's observed history is a more defensible forecast than a black-box regressor alone.

The hard constraint I set: never present curated data as live, a heuristic as a forecast, or rule-based NLP as an LLM. That honesty requirement shaped the whole architecture.

Architecture

A React + Vite SPA talks to a FastAPI backend whose route modules fan out to independent engines (ml, nlp, cv, geo, ingestion, services). Persistence is a curated 116-city dataset backed by Postgres/PostGIS, with an in-memory fallback so the app still runs without a database.

Tech stack

LayerChoices
FrontendReact, Vite, interactive map, client-side watchlist (localStorage)
APIFastAPI, Pydantic, Alembic migrations
MLXGBoost gradient-boosted regressor + TreeSHAP attribution (scikit-learn fallback)
NLPTF-IDF + rule-based information extraction
CVscipy.ndimage morphology over urban-footprint rasters
Spatialshapely geometry, PostGIS via GeoAlchemy2 (in-memory fallback)
Similaritycosine similarity, FAISS-accelerated at scale (NumPy fallback)
IngestionOpenStreetMap Overpass + Nominatim (both ODbL)

The four engines

  • Price model (ML). A trained XGBoost regressor predicts land-price CAGR from infrastructure + demographic features, and emits per-prediction TreeSHAP attributions so each forecast comes with "why" drivers.
  • Infrastructure signals (NLP). TF-IDF + rule-based extraction turns announcements (highways, airports, metros, industrial corridors) into scored leading indicators with an impact score and a lead time.
  • Urban-growth raster (CV). Morphology over per-year footprint rasters yields compactness, fragmentation and a dominant growth direction.
  • Spatial (Geo). shapely growth-ring geometry served as GeoJSON; PostGIS-ready for real spatial queries at scale.

Notable algorithm — "City DNA" twin matching

Each city is a feature vector (infrastructure, demographics, price history). Finding a twin is a nearest-neighbour by cosine similarity problem; at small scale it's a NumPy matmul, and FAISS takes over when the catalogue grows.

Python
import numpy as np

def most_similar(target: np.ndarray, catalogue: np.ndarray, k: int = 1):
    # L2-normalise rows so dot product == cosine similarity
    t = target / np.linalg.norm(target)
    C = catalogue / np.linalg.norm(catalogue, axis=1, keepdims=True)
    sims = C @ t                      # O(n·d)
    idx = np.argpartition(-sims, k)[:k]
    return idx[np.argsort(-sims[idx])]
  • Brute force: O(n·d) per query — fine for 116 cities.
  • FAISS: builds an index once (e.g. IVF/HNSW) so queries become sub-linear — the upgrade path when the catalogue is 100k+ vectors. The NumPy and FAISS paths return identical results, which made FAISS a safe, test-backed optimization rather than a rewrite.
Why this reads well in an interview

It's a clean story arc: a real product need (find a twin) → the right CS primitive (kNN / cosine) → a measured scaling decision (FAISS) → a correctness guarantee (identical results, NumPy fallback). That's exactly the problem→primitive→trade-off→validation shape interviewers want.

Engineering decisions & trade-offs

  • Provenance envelope. Every live datapoint is wrapped with source · license · fetched_at · confidence · freshness_score. If a source is down, the API returns an explicit available: false envelope instead of a fabricated number. This is a deliberate trade of "always show something" for "never lie".
  • Compliance gating. Listing portals prohibit scraping in their ToS, so a RobotsGate + a SOURCE_REGISTRY make that adapter refuse to run by design — a licensed feed plugs in as a new permitted source.
  • Graceful degradation everywhere. FAISS→NumPy, PostGIS→in-memory, XGBoost→scikit-learn. The app demos and runs even when the heavy dependency is missing.
  • Production-grade ingestion plumbing. Per-host rate limiting, retry/backoff honoring Retry-After, endpoint failover and on-disk TTL caching.

Challenges & how I solved them

  • Honest forecasting. Black-box CAGR felt untrustworthy, so I anchored forecasts to a real twin's observed trajectory and bounded growth by development phase (logistic S-curve ceiling) rather than letting a regressor extrapolate to the moon.
  • Explainability. Interviewers (and users) ask "why this score?" — TreeSHAP attributions and transparent, documented score formulas answer it.
  • Running without infra. The in-memory + fallback layers mean a reviewer can pip install and run without Postgres/FAISS/XGBoost.

Scaling & what I'd improve

  • Move similarity to a persistent FAISS index (or pgvector) and precompute twins offline.
  • Replace TTL file cache with Redis; move ingestion to a queue + workers with a scheduler.
  • Add a proper feature store so the ML and twin engines share one versioned feature definition.

Likely interview questions