Retrieval at Scale

Assumed knowledge

Dense vs. sparse retrieval
- Dual-encoder (“two-tower”) = query & document encoded separately into single vectors. Eg. Dense Passage Retrieval (DPR) – Karpukhin et al. EMNLP 2020 (aclanthology.org).
- Sparse, term-based retrieval = BM25 and its probabilistic relevance model. Intro tutorial: “What is BM25…” GeeksforGeeks, Jul 23 2025 (geeksforgeeks.org).
Approximate nearest neighbor (ANN) search
- Graph‐based (HNSW), clustering (IVF-PQ), quantization. See Faiss v1.11.0 release notes (github.com).
Late interaction retrieval
- Multi-vector representations allowing token-level interaction at search time. See ColBERT (SIGIR 2020) (arxiv.org).
Retrieval-augmented generation (RAG)
- LLMs augmented with external datastore retrieval (e.g. Fusion-in-Decoder, Atlas) (jmlr.org, arxiv.org).
Infrastructure at scale
- Disk-based vector indexes (DiskANN) vs. in-memory (HNSW). DiskANN in SQL Server 2025 public preview May 19 2025 (techcommunity.microsoft.com).

Areas still evolving (tracked in DeltaDrops):

Late interaction efficiency & compression (ColBERTv2 → PLAID → SPLATE).
Learned sparse retrievers (SPLADE-v3, Mistral-SPLADE).
Generative retrieval & hybrid models (Atlas, LIGER, GeAR).
Vector infrastructure: on-disk indexes, quantization, continuous indexing, scale to trillions.
Hybrid sparse‐dense and multi-stage pipelines (BM25 → dense → cross-encoder).

What to know

Traditional two-tower (dual encoder) dense retrieval (e.g. DPR) excels in speed but can lose fine-grained token interactions at scale (aclanthology.org, arxiv.org).
Late interaction (ColBERT) stores per-token embeddings and computes MaxSim at query time for richer matching. ColBERTv2 applies aggressive residual compression & denoised supervision to cut footprint 6–10× while improving quality (arxiv.org).
PLAID engine speeds ColBERTv2 up to 7× on GPU and 45× on CPU with centroid interaction/pruning, preserving SOTA accuracy at tens-of-ms latency on 140 M passages (arxiv.org).
SPLATE adapts ColBERTv2 for CPU by mapping token embeddings to a sparse vocabulary (via SPLADE), enabling <10 ms candidate generation and matching PLAID’s effectiveness (arxiv.org).
Learned sparse retrievers like SPLADE-v3 push SPLADE to >40 MRR@10 on MS MARCO and +2% BEIR out-of-domain (arxiv.org). Mistral-SPLADE uses a decoder-only LLM backbone to further improve BEIR performance, now SOTA among sparse retrievers (arxiv.org).
Generative retrieval (sequence-to-sequence models as retrievers) and RAG: Atlas demonstrates that a retrieval-augmented T5 with Fusion-in-Decoder outperforms a 540B-parameter model on Natural Questions few-shot, achieving 42% accuracy with 64 examples (jmlr.org, arxiv.org).
Hybrid generative–dense retrieval: LIGER combines a generative candidate set with dense re-ranking to improve cold-start recall in recommendation benchmarks (Amazon Beauty, Steam), narrowing gap with dense-only methods (reddit.com).
Graph-enhanced RAG (GeAR) uses graph expansion around retrieved docs to boost multi-hop QA, improving MuSiQue performance by >10% and reducing token/iteration count (arxiv.org).
On-disk vector indexes: DiskANN integrates into SQL Server 2025 (public preview May 19 2025) and Azure Database for PostgreSQL (GA May 19 2025), delivering 10× faster queries and up to 96× lower memory vs. HNSW‐pgvector (techcommunity.microsoft.com).
Core libraries & frameworks:
- Faiss (v1.11.0) continues to add RaBitQ, HNSW improvements, sharding, GPU support for quantized indexes (github.com).
- Hugging Face transformers support DPR, ColBERTv2, SPLADE.
- Open-source engines: Vespa, Weaviate, Milvus, Qdrant.

Starter sources

Late interaction & multi-vector

ColBERT: “Efficient and Effective Passage Search via Contextualized Late Interaction over BERT” (SIGIR ’20) (arxiv.org)
ColBERTv2: “Effective and Efficient Retrieval via Lightweight Late Interaction” (NAACL ’22) (aclanthology.org)
PLAID: “An Efficient Engine for Late Interaction Retrieval” (CIKM ’22) (arxiv.org)
SPLATE: “Sparse Late Interaction Retrieval” (SIGIR ’24) (arxiv.org)

Learned sparse retrieval

SPLADE-v3: “New baselines for SPLADE” (arXiv Mar 2024) (arxiv.org)
Mistral-SPLADE: “LLMs for better Learned Sparse Retrieval” (arXiv Aug 2024) (arxiv.org)
Adapter-based SPLADE: Pal et al., “Parameter-Efficient Sparse Retrievers…” (arXiv Mar 2023) (arxiv.org)

Dense & two-tower

DPR: “Dense Passage Retrieval for Open-Domain QA” (EMNLP ’20) (aclanthology.org, arxiv.org)
Faiss: GitHub, “facebookresearch/faiss” (v1.11.0 changelog) (github.com)

Generative & hybrid retrieval

Atlas: Izacard et al., “Few-shot Learning with Retrieval-Augmented Language Models” (JMLR 2023; arXiv Aug 2022) (jmlr.org, arxiv.org)
LIGER: Meta AI, “LeveragIng dense retrieval for GEnerative Retrieval” (arXiv Nov 2024; summary on Reddit) (reddit.com)
GeAR: “Graph-enhanced Agent for RAG” (arXiv Dec 2024) (arxiv.org)
Survey: “The Survey of Retrieval-Augmented Text Generation…” (arXiv Apr 2024) (arxiv.org)

Infrastructure & indexing

DiskANN in SQL Server 2025 (public preview May 19 2025) & Azure PostgreSQL GA May 19 2025 (techcommunity.microsoft.com)
Vector DB quantization & SSD use cases: KIOXIA blog (PCIe 5.0 SSDs & DiskANN) (blog-us.kioxia.com)
Vector DB integration: Azure Cosmos DB + DiskANN (Jun 2024) (techcommunity.microsoft.com)

Key people & orgs

Matei Zaharia, Omar Khattab, Christopher Potts (ColBERT family)
Patrick Lewis, Danqi Chen, Wen-tau Yih (DPR)
Stéphane Clinchant, Hervé Déjean, Thibault Formal (SPLADE)
Gautier Izacard, Sebastian Riedel (Atlas)
Microsoft Research (DiskANN), Meta AI, Fujitsu Research

Tools & libraries

Faiss (CPU/GPU, quantization, sharding)
DiskANN (SQL Server 2025, Azure PostgreSQL)
DPR & ColBERT implementations on Hugging Face
SPLADE via 🤗 transformers and official library
qdrant, Milvus, Vespa, Weaviate for end-to-end retrieval systems
rank_bm25 for quick BM25 prototyping (PyPI) (geeksforgeeks.org)

This baseline equips you to dive deeper into retrieval innovations—expect regular DeltaDrops on late interaction advances, sparse/dense hybrids, generative retrieval trends, and exploding vector-search infrastructure.