Retrieval at Scale

Assumed knowledge

  • Dense vs. sparse retrieval
    • Dual-encoder (“two-tower”) = query & document encoded separately into single vectors. Eg. Dense Passage Retrieval (DPR) – Karpukhin et al. EMNLP 2020 (aclanthology.org).
    • Sparse, term-based retrieval = BM25 and its probabilistic relevance model. Intro tutorial: “What is BM25…” GeeksforGeeks, Jul 23 2025 (geeksforgeeks.org).
  • Approximate nearest neighbor (ANN) search
    • Graph‐based (HNSW), clustering (IVF-PQ), quantization. See Faiss v1.11.0 release notes (github.com).
  • Late interaction retrieval
    • Multi-vector representations allowing token-level interaction at search time. See ColBERT (SIGIR 2020) (arxiv.org).
  • Retrieval-augmented generation (RAG)
    • LLMs augmented with external datastore retrieval (e.g. Fusion-in-Decoder, Atlas) (jmlr.org, arxiv.org).
  • Infrastructure at scale
    • Disk-based vector indexes (DiskANN) vs. in-memory (HNSW). DiskANN in SQL Server 2025 public preview May 19 2025 (techcommunity.microsoft.com).

Areas still evolving (tracked in DeltaDrops):

  • Late interaction efficiency & compression (ColBERTv2 → PLAID → SPLATE).
  • Learned sparse retrievers (SPLADE-v3, Mistral-SPLADE).
  • Generative retrieval & hybrid models (Atlas, LIGER, GeAR).
  • Vector infrastructure: on-disk indexes, quantization, continuous indexing, scale to trillions.
  • Hybrid sparse‐dense and multi-stage pipelines (BM25 → dense → cross-encoder).

What to know

  • Traditional two-tower (dual encoder) dense retrieval (e.g. DPR) excels in speed but can lose fine-grained token interactions at scale (aclanthology.org, arxiv.org).
  • Late interaction (ColBERT) stores per-token embeddings and computes MaxSim at query time for richer matching. ColBERTv2 applies aggressive residual compression & denoised supervision to cut footprint 6–10× while improving quality (arxiv.org).
  • PLAID engine speeds ColBERTv2 up to 7× on GPU and 45× on CPU with centroid interaction/pruning, preserving SOTA accuracy at tens-of-ms latency on 140 M passages (arxiv.org).
  • SPLATE adapts ColBERTv2 for CPU by mapping token embeddings to a sparse vocabulary (via SPLADE), enabling <10 ms candidate generation and matching PLAID’s effectiveness (arxiv.org).
  • Learned sparse retrievers like SPLADE-v3 push SPLADE to >40 MRR@10 on MS MARCO and +2% BEIR out-of-domain (arxiv.org). Mistral-SPLADE uses a decoder-only LLM backbone to further improve BEIR performance, now SOTA among sparse retrievers (arxiv.org).
  • Generative retrieval (sequence-to-sequence models as retrievers) and RAG: Atlas demonstrates that a retrieval-augmented T5 with Fusion-in-Decoder outperforms a 540B-parameter model on Natural Questions few-shot, achieving 42% accuracy with 64 examples (jmlr.org, arxiv.org).
  • Hybrid generative–dense retrieval: LIGER combines a generative candidate set with dense re-ranking to improve cold-start recall in recommendation benchmarks (Amazon Beauty, Steam), narrowing gap with dense-only methods (reddit.com).
  • Graph-enhanced RAG (GeAR) uses graph expansion around retrieved docs to boost multi-hop QA, improving MuSiQue performance by >10% and reducing token/iteration count (arxiv.org).
  • On-disk vector indexes: DiskANN integrates into SQL Server 2025 (public preview May 19 2025) and Azure Database for PostgreSQL (GA May 19 2025), delivering 10× faster queries and up to 96× lower memory vs. HNSW‐pgvector (techcommunity.microsoft.com).
  • Core libraries & frameworks:
    • Faiss (v1.11.0) continues to add RaBitQ, HNSW improvements, sharding, GPU support for quantized indexes (github.com).
    • Hugging Face transformers support DPR, ColBERTv2, SPLADE.
    • Open-source engines: Vespa, Weaviate, Milvus, Qdrant.

Starter sources

Late interaction & multi-vector

  • ColBERT: “Efficient and Effective Passage Search via Contextualized Late Interaction over BERT” (SIGIR ’20) (arxiv.org)
  • ColBERTv2: “Effective and Efficient Retrieval via Lightweight Late Interaction” (NAACL ’22) (aclanthology.org)
  • PLAID: “An Efficient Engine for Late Interaction Retrieval” (CIKM ’22) (arxiv.org)
  • SPLATE: “Sparse Late Interaction Retrieval” (SIGIR ’24) (arxiv.org)

Learned sparse retrieval

  • SPLADE-v3: “New baselines for SPLADE” (arXiv Mar 2024) (arxiv.org)
  • Mistral-SPLADE: “LLMs for better Learned Sparse Retrieval” (arXiv Aug 2024) (arxiv.org)
  • Adapter-based SPLADE: Pal et al., “Parameter-Efficient Sparse Retrievers…” (arXiv Mar 2023) (arxiv.org)

Dense & two-tower

Generative & hybrid retrieval

  • Atlas: Izacard et al., “Few-shot Learning with Retrieval-Augmented Language Models” (JMLR 2023; arXiv Aug 2022) (jmlr.org, arxiv.org)
  • LIGER: Meta AI, “LeveragIng dense retrieval for GEnerative Retrieval” (arXiv Nov 2024; summary on Reddit) (reddit.com)
  • GeAR: “Graph-enhanced Agent for RAG” (arXiv Dec 2024) (arxiv.org)
  • Survey: “The Survey of Retrieval-Augmented Text Generation…” (arXiv Apr 2024) (arxiv.org)

Infrastructure & indexing

Key people & orgs

  • Matei Zaharia, Omar Khattab, Christopher Potts (ColBERT family)
  • Patrick Lewis, Danqi Chen, Wen-tau Yih (DPR)
  • Stéphane Clinchant, Hervé Déjean, Thibault Formal (SPLADE)
  • Gautier Izacard, Sebastian Riedel (Atlas)
  • Microsoft Research (DiskANN), Meta AI, Fujitsu Research

Tools & libraries

  • Faiss (CPU/GPU, quantization, sharding)
  • DiskANN (SQL Server 2025, Azure PostgreSQL)
  • DPR & ColBERT implementations on Hugging Face
  • SPLADE via 🤗 transformers and official library
  • qdrant, Milvus, Vespa, Weaviate for end-to-end retrieval systems
  • rank_bm25 for quick BM25 prototyping (PyPI) (geeksforgeeks.org)

This baseline equips you to dive deeper into retrieval innovations—expect regular DeltaDrops on late interaction advances, sparse/dense hybrids, generative retrieval trends, and exploding vector-search infrastructure.