Retrieval at Scale | Drop for 2026-01-19

TL;DR

Five items worth your time since Jan 3, 2026: (1) Faiss mainline fixed correctness issues in multi‑bit RaBitQ for inner‑product search; (2) Vespa published a data‑driven guide on embedding/quantization trade‑offs (binary, INT8, FP16) with hybrid retrieval takeaways; (3) FaTRQ proposes far‑memory–aware refinement that avoids paging full‑precision vectors, reporting big throughput gains; (4) SVFusion outlines a CPU‑GPU co‑processing architecture for real‑time vector search with streaming updates; (5) Elastic 9.2.4 shipped security fixes—recommended for Lucene/Elasticsearch‑based stacks.

  • Key facts and current state of the topic
    • RaBitQ (binary quantization) was integrated in recent Faiss releases; many teams are piloting multi‑bit RaBitQ for higher recall at similar footprint. (github.com)
  • Important context and background information
    • After 1.13.2, Faiss main landed multiple correctness fixes for multi‑bit RaBitQ with IP metric (distance computation and filtering), plus refactors that reduce sharding OOM risk. If you rely on multi‑bit RaBitQ for IP, these matter for ranking fidelity. (github.com)
  • Recent developments or changes
    • Jan 12–14 commits fix IP distance and filtering logic in multi‑bit RaBitQ. Until a new tag ships, consider building from main or validating your workloads against 1.13.2 vs. main. (github.com)
  • Key facts and current state of the topic
    • Vespa benchmarked popular embedding models across precisions (FP32/FP16/INT8/binary), hardware (Graviton/T4), and hybrid scoring variants—finding that hybrid (BM25 + vectors) consistently beats vectors‑only. (blog.vespa.ai)
  • Important context and background information
    • Practical takeaways include: bfloat16 vectors show negligible quality loss vs. FP32; INT8 model weights speed CPU inference ~2.7–3.4×; some ModernBERT‑based models retain ~98% quality under binary vectors; and binary Hamming enables much higher candidate counts within the same budget. (blog.vespa.ai)
  • Recent developments or changes
    • Posted Jan 14, 2026; includes an interactive leaderboard and concrete config snippets—useful for planning cost/latency knobs in hybrid and late‑interaction pipelines. (blog.vespa.ai)

FaTRQ (Jan 15): tiered residual quantization for far‑memory refinement

  • Key facts and current state of the topic
    • ANN stacks often pay a large “verification/refinement” cost by paging full‑precision vectors from SSD/object store after coarse search. (arxiv.org)
  • Important context and background information
    • FaTRQ proposes progressive distance estimation using compact residuals stored in far memory (with ternary codes) and early stopping when candidates fall outside top‑k. (arxiv.org)
  • Recent developments or changes
    • The authors report up to 9× throughput gains vs. a strong GPU ANN system and 2.4× better storage efficiency; they prototype a CXL Type‑2 accelerator. Treat as research to validate on your embeddings/latency targets. (arxiv.org)
  • Key facts and current state of the topic
    • Real‑time vector search must balance high QPS with frequent inserts/deletes and limited GPU memory. (arxiv.org)
  • Important context and background information
    • SVFusion sketches a hierarchical index with CPU‑GPU‑disk collaboration, workload‑aware vector caching, CUDA multi‑streaming, and concurrency control under mixed update/query loads. (arxiv.org)
  • Recent developments or changes
    • Reported results show 20.9× higher throughput on average and 1.3×–50.7× lower latency vs. baselines at comparable recall under streaming workloads—promising directions for ads‑scale freshness. (arxiv.org)

Elastic 9.2.4: security update for Lucene/Elasticsearch stacks

  • Key facts and current state of the topic
    • Elastic Stack 9.2.4 released Jan 13, 2026; the security advisories recommend upgrading due to an LZ4 Java information‑disclosure vulnerability affecting prior 9.2.x (and other branches). (elastic.co)
  • Important context and background information
    • Lucene/Elasticsearch underlie many filtered‑ANN and hybrid pipelines; patching minimizes operational risk in candidate generation and ranking tiers. (elastic.co)
  • Recent developments or changes
    • Upgrade guidance and affected version details are in Elastic’s release notes/forum; prioritize clusters running vector + lexical search. (elastic.co)