LandAI — Urban Growth Prediction

A full-stack ML platform that predicts where India's land value will rise by matching emerging Tier-3 cities to their historically-similar Tier-2 'twins'. FastAPI + React, four analytical engines, transparency-first.

One-liner

LandAI predicts where India's land value will rise before prices move. It finds an emerging Tier-3 city's more-developed "twin", time-shifts that twin's real trajectory onto the target, and forecasts which zones develop over the next 5–10 years — with every number carrying its provenance.

Problem & motivation

Indian land investing is dominated by anecdote and broker incentives. There's plenty of historical price data but no honest, explainable view of future direction for the long tail of Tier-3 cities. LandAI's bet: a city's growth rhymes with a structurally-similar city that developed earlier, so similarity + that twin's observed history is a more defensible forecast than a black-box regressor alone.

The hard constraint I set: never present curated data as live, a heuristic as a forecast, or rule-based NLP as an LLM. That honesty requirement shaped the whole architecture.

Architecture

A React + Vite SPA talks to a FastAPI backend whose route modules fan out to independent engines (ml, nlp, cv, geo, ingestion, services). Persistence is a curated 116-city dataset backed by Postgres/PostGIS, with an in-memory fallback so the app still runs without a database.

Tech stack

Layer	Choices
Frontend	React, Vite, interactive map, client-side watchlist (localStorage)
API	FastAPI, Pydantic, Alembic migrations
ML	XGBoost gradient-boosted regressor + TreeSHAP attribution (scikit-learn fallback)
NLP	TF-IDF + rule-based information extraction
CV	`scipy.ndimage` morphology over urban-footprint rasters
Spatial	`shapely` geometry, PostGIS via GeoAlchemy2 (in-memory fallback)
Similarity	cosine similarity, FAISS-accelerated at scale (NumPy fallback)
Ingestion	OpenStreetMap Overpass + Nominatim (both ODbL)

The four engines

Price model (ML). A trained XGBoost regressor predicts land-price CAGR from infrastructure + demographic features, and emits per-prediction TreeSHAP attributions so each forecast comes with "why" drivers.
Infrastructure signals (NLP). TF-IDF + rule-based extraction turns announcements (highways, airports, metros, industrial corridors) into scored leading indicators with an impact score and a lead time.
Urban-growth raster (CV). Morphology over per-year footprint rasters yields compactness, fragmentation and a dominant growth direction.
Spatial (Geo). shapely growth-ring geometry served as GeoJSON; PostGIS-ready for real spatial queries at scale.

Notable algorithm — "City DNA" twin matching

Each city is a feature vector (infrastructure, demographics, price history). Finding a twin is a nearest-neighbour by cosine similarity problem; at small scale it's a NumPy matmul, and FAISS takes over when the catalogue grows.

Python

import numpy as np

def most_similar(target: np.ndarray, catalogue: np.ndarray, k: int = 1):
    # L2-normalise rows so dot product == cosine similarity
    t = target / np.linalg.norm(target)
    C = catalogue / np.linalg.norm(catalogue, axis=1, keepdims=True)
    sims = C @ t                      # O(n·d)
    idx = np.argpartition(-sims, k)[:k]
    return idx[np.argsort(-sims[idx])]

Brute force: O(n·d) per query — fine for 116 cities.
FAISS: builds an index once (e.g. IVF/HNSW) so queries become sub-linear — the upgrade path when the catalogue is 100k+ vectors. The NumPy and FAISS paths return identical results, which made FAISS a safe, test-backed optimization rather than a rewrite.

Why this reads well in an interview

It's a clean story arc: a real product need (find a twin) → the right CS primitive (kNN / cosine) → a measured scaling decision (FAISS) → a correctness guarantee (identical results, NumPy fallback). That's exactly the problem→primitive→trade-off→validation shape interviewers want.

Engineering decisions & trade-offs

Provenance envelope. Every live datapoint is wrapped with source · license · fetched_at · confidence · freshness_score. If a source is down, the API returns an explicit available: false envelope instead of a fabricated number. This is a deliberate trade of "always show something" for "never lie".
Compliance gating. Listing portals prohibit scraping in their ToS, so a RobotsGate + a SOURCE_REGISTRY make that adapter refuse to run by design — a licensed feed plugs in as a new permitted source.
Graceful degradation everywhere. FAISS→NumPy, PostGIS→in-memory, XGBoost→scikit-learn. The app demos and runs even when the heavy dependency is missing.
Production-grade ingestion plumbing. Per-host rate limiting, retry/backoff honoring Retry-After, endpoint failover and on-disk TTL caching.

Challenges & how I solved them

Honest forecasting. Black-box CAGR felt untrustworthy, so I anchored forecasts to a real twin's observed trajectory and bounded growth by development phase (logistic S-curve ceiling) rather than letting a regressor extrapolate to the moon.
Explainability. Interviewers (and users) ask "why this score?" — TreeSHAP attributions and transparent, documented score formulas answer it.
Running without infra. The in-memory + fallback layers mean a reviewer can pip install and run without Postgres/FAISS/XGBoost.

Scaling & what I'd improve

Move similarity to a persistent FAISS index (or pgvector) and precompute twins offline.
Replace TTL file cache with Redis; move ingestion to a queue + workers with a scheduler.
Add a proper feature store so the ML and twin engines share one versioned feature definition.