One-liner
LandAI predicts where India's land value will rise before prices move. It finds an emerging Tier-3 city's more-developed "twin", time-shifts that twin's real trajectory onto the target, and forecasts which zones develop over the next 5–10 years — with every number carrying its provenance.
Problem & motivation
Indian land investing is dominated by anecdote and broker incentives. There's plenty of historical price data but no honest, explainable view of future direction for the long tail of Tier-3 cities. LandAI's bet: a city's growth rhymes with a structurally-similar city that developed earlier, so similarity + that twin's observed history is a more defensible forecast than a black-box regressor alone.
The hard constraint I set: never present curated data as live, a heuristic as a forecast, or rule-based NLP as an LLM. That honesty requirement shaped the whole architecture.
Architecture
A React + Vite SPA talks to a FastAPI backend whose route modules fan
out to independent engines (ml, nlp, cv, geo, ingestion, services).
Persistence is a curated 116-city dataset backed by Postgres/PostGIS, with an
in-memory fallback so the app still runs without a database.
Tech stack
| Layer | Choices |
|---|---|
| Frontend | React, Vite, interactive map, client-side watchlist (localStorage) |
| API | FastAPI, Pydantic, Alembic migrations |
| ML | XGBoost gradient-boosted regressor + TreeSHAP attribution (scikit-learn fallback) |
| NLP | TF-IDF + rule-based information extraction |
| CV | scipy.ndimage morphology over urban-footprint rasters |
| Spatial | shapely geometry, PostGIS via GeoAlchemy2 (in-memory fallback) |
| Similarity | cosine similarity, FAISS-accelerated at scale (NumPy fallback) |
| Ingestion | OpenStreetMap Overpass + Nominatim (both ODbL) |
The four engines
- Price model (ML). A trained XGBoost regressor predicts land-price CAGR from infrastructure + demographic features, and emits per-prediction TreeSHAP attributions so each forecast comes with "why" drivers.
- Infrastructure signals (NLP). TF-IDF + rule-based extraction turns announcements (highways, airports, metros, industrial corridors) into scored leading indicators with an impact score and a lead time.
- Urban-growth raster (CV). Morphology over per-year footprint rasters yields compactness, fragmentation and a dominant growth direction.
- Spatial (Geo).
shapelygrowth-ring geometry served as GeoJSON; PostGIS-ready for real spatial queries at scale.
Notable algorithm — "City DNA" twin matching
Each city is a feature vector (infrastructure, demographics, price history). Finding a twin is a nearest-neighbour by cosine similarity problem; at small scale it's a NumPy matmul, and FAISS takes over when the catalogue grows.
import numpy as np
def most_similar(target: np.ndarray, catalogue: np.ndarray, k: int = 1):
# L2-normalise rows so dot product == cosine similarity
t = target / np.linalg.norm(target)
C = catalogue / np.linalg.norm(catalogue, axis=1, keepdims=True)
sims = C @ t # O(n·d)
idx = np.argpartition(-sims, k)[:k]
return idx[np.argsort(-sims[idx])]
- Brute force: O(n·d) per query — fine for 116 cities.
- FAISS: builds an index once (e.g. IVF/HNSW) so queries become sub-linear — the upgrade path when the catalogue is 100k+ vectors. The NumPy and FAISS paths return identical results, which made FAISS a safe, test-backed optimization rather than a rewrite.
It's a clean story arc: a real product need (find a twin) → the right CS primitive (kNN / cosine) → a measured scaling decision (FAISS) → a correctness guarantee (identical results, NumPy fallback). That's exactly the problem→primitive→trade-off→validation shape interviewers want.
Engineering decisions & trade-offs
- Provenance envelope. Every live datapoint is wrapped with
source · license · fetched_at · confidence · freshness_score. If a source is down, the API returns an explicitavailable: falseenvelope instead of a fabricated number. This is a deliberate trade of "always show something" for "never lie". - Compliance gating. Listing portals prohibit scraping in their ToS, so a
RobotsGate+ aSOURCE_REGISTRYmake that adapter refuse to run by design — a licensed feed plugs in as a new permitted source. - Graceful degradation everywhere. FAISS→NumPy, PostGIS→in-memory, XGBoost→scikit-learn. The app demos and runs even when the heavy dependency is missing.
- Production-grade ingestion plumbing. Per-host rate limiting,
retry/backoff honoring
Retry-After, endpoint failover and on-disk TTL caching.
Challenges & how I solved them
- Honest forecasting. Black-box CAGR felt untrustworthy, so I anchored forecasts to a real twin's observed trajectory and bounded growth by development phase (logistic S-curve ceiling) rather than letting a regressor extrapolate to the moon.
- Explainability. Interviewers (and users) ask "why this score?" — TreeSHAP attributions and transparent, documented score formulas answer it.
- Running without infra. The in-memory + fallback layers mean a reviewer
can
pip installand run without Postgres/FAISS/XGBoost.
Scaling & what I'd improve
- Move similarity to a persistent FAISS index (or pgvector) and precompute twins offline.
- Replace TTL file cache with Redis; move ingestion to a queue + workers with a scheduler.
- Add a proper feature store so the ML and twin engines share one versioned feature definition.