Why this matters
When an interviewer says "you mentioned similarity search — how does that work?", they're testing whether you understand the algorithm under your own project. Each item below is: the idea → the DSA primitive → complexity → the one-paragraph answer.
Cosine-similarity twin matching (LandAI)
Each city is a feature vector; the "twin" is the nearest neighbour by cosine similarity = dot product of L2-normalized vectors. Brute force is a matrix multiply; FAISS (IVF/HNSW index) makes it sub-linear at scale.
- Primitive: kNN / vector dot product; top-k via
argpartition(quickselect). - Complexity: O(n·d) brute per query; ~O(log n) or better with an index.
Gradient-boosted trees + TreeSHAP (LandAI)
XGBoost is an additive ensemble: each tree corrects the previous trees' residuals (gradient boosting). Inference walks each tree root→leaf. TreeSHAP computes each feature's signed contribution to a prediction (game-theoretic Shapley values, computed efficiently for trees).
- Primitive: binary tree traversal; greedy split selection at train time.
- Complexity: inference O(#trees · depth) per prediction.
TF-IDF infrastructure signals (LandAI)
Turn documents into sparse weighted vectors: term frequency × inverse document frequency down-weights common words. Ranking relevant announcements is then cosine similarity over sparse vectors.
- Primitive: hash maps (term → count/weight), sparse dot product.
- Complexity: O(total tokens) to build; O(nnz) per similarity.
Morphology over urban rasters (LandAI)
scipy.ndimage runs connected-components + dilation/erosion over per-year
footprint grids to measure compactness, fragmentation and growth direction — the
same flood-fill/BFS idea as "number of islands", on an image grid.
- Primitive: grid BFS/DFS (connected components); convolution-style passes.
- Complexity: O(pixels).
Technical indicators & DCF (StockVision)
A moving average / EMA over a price series is a sliding-window aggregate —
O(n) with incremental update rather than recomputing each window. A DCF is a
discounted sum: each future cash flow divided by (1+r)^t, summed.
- Primitive: sliding window; prefix/rolling aggregate.
- Complexity: O(n) for indicators over n bars.
Live price tick & leaderboard (StockStump)
Each price update is O(1) (read price, apply delta, write to Redis). The leaderboard is a sorted set (Redis ZSET) — a skip-list/heap-backed structure giving O(log n) updates and O(log n + k) range reads, instead of re-sorting all users on every change.
- Primitive: hash map (O(1) price), ordered set / skip list (leaderboard).
- Complexity: O(1) tick; O(log n) rank update.
"Twin matching is cosine kNN — O(n·d) brute force, fine at 116 cities; I'd move to a FAISS HNSW index for sub-linear queries at a million." That single sentence shows you know the primitive, the cost, and the upgrade path.