Retrieval at Scale | Drop for 2025-12-10

TL;DR

Five notable retrieval updates since Nov 27, 2025: (1) Faiss v1.13.1 adds multi‑bit RaBitQ and bakes PANORAMA into Flat/HNSW for faster verification; (2) Amazon S3 Vectors became GA at re:Invent, scaling to 2B vectors/index and integrating with Bedrock KBs and OpenSearch; (3) Amazon OpenSearch Service introduced serverless GPU‑accelerated index builds and auto‑optimization for vector indexes; (4) Vespa shipped SIMD (Google Highway) speedups that lower exact and ANN latency on x86 and Graviton; (5) a new arXiv paper proposes query‑adaptive ef for HNSW (Ada‑ef), reporting large latency reductions while meeting target recall.

Faiss v1.13.1: multi‑bit RaBitQ + PANORAMA integrated

Key facts and current state of the topic
- Faiss remains a primary ANN workhorse; quantization and verification‑stage costs often bound recall/latency at scale. v1.13.1 was released on December 3, 2025. (github.com)
Important context and background information
- Prior work (PANORAMA) accelerates the final distance‑verification stage; RaBitQ improves binary quantization fidelity. Integrating these into stable Faiss lowers end‑to‑end latency without sacrificing recall. (github.com)
Recent developments or changes
- v1.13.1 adds multi‑bit RaBitQ (2–9 bits), integrates PANORAMA into IndexFlatL2Panorama and IndexHNSWFlatPanorama, and optimizes the scalar quantizer—worth A/B testing against PQ/previous RaBitQ at your target recall. (github.com)

Amazon S3 Vectors GA: durable, low‑cost vector storage at scale

Key facts and current state of the topic
- Announced GA on December 2, 2025: up to 2 billion vectors per index (40× preview), up to 10,000 indexes per vector bucket, and native APIs for store/query—no infra to manage. (aws.amazon.com)
Important context and background information
- Integrates with Amazon Bedrock Knowledge Bases and OpenSearch, enabling tiered/hybrid storage for RAG, semantic search, and agent memory with up to 90% lower costs vs. specialized vector DBs (per AWS). (aws.amazon.com)
Recent developments or changes
- Frequent‑query latencies target ~100 ms or less; GA expands regional coverage from 5 to 14 Regions—relevant for globally distributed retrieval and cost‑aware cold‑vector tiers. (aws.amazon.com)

Amazon OpenSearch Service: GPU‑accelerated builds + auto‑optimized vector indexes

Key facts and current state of the topic
- New capabilities (Dec 2, 2025) let you build billion‑scale vector indexes in under an hour using serverless GPU acceleration, and auto‑optimize index settings to meet specified latency/recall targets. (aws.amazon.com)
Important context and background information
- Auto‑optimization explores k‑NN algorithms, quantization, and engine parameters; GPUs activate on‑demand, reducing indexing time and cost without provisioning GPU instances. (aws.amazon.com)
Recent developments or changes
- AWS reports up to 10× faster builds at ~25% of indexing cost; features are available on OpenSearch Service 3.1+ (GPU) and 2.17+ (auto‑optimize) in selected Regions—plan controlled rollouts. (aws.amazon.com)

Vespa: SIMD (Google Highway) acceleration for exact distance and HNSW

Key facts and current state of the topic
- Vespa’s December 3, 2025 newsletter details Google Highway–based SIMD kernels that speed exact vector distance and improve HNSW performance on Intel and AWS Graviton. (blog.vespa.ai)
Important context and background information
- Exact search can beat ANN under heavy filtering; faster exact distance helps hybrid and filtered pipelines while also improving HNSW feed/search throughput. (blog.vespa.ai)
Recent developments or changes
- Reported gains include roughly 20% lower HNSW latency and 20–25% lower exact‑search latency on x86, plus up to 27% lower exact latency on Graviton and higher QPS for binary Hamming—arriving automatically in Vespa Cloud or via upgrade. (blog.vespa.ai)

Adaptive‑ef for HNSW: distribution‑aware exploration with recall targets

Key facts and current state of the topic
- A December 7, 2025 arXiv paper proposes Ada‑ef, which dynamically sets HNSW’s ef per query to approximately meet a declarative recall target. (arxiv.org)
Important context and background information
- Static ef is distribution‑agnostic and can over/under‑search skewed workloads; a learned, update‑friendly model can cut wasted compute while preserving recall. (arxiv.org)
Recent developments or changes
- On real embeddings (e.g., OpenAI/Cohere), authors report up to 4× lower online query latency versus learning‑based adaptive baselines, with 50× less offline compute and 100× less offline memory to train the model—promising for p95‑bounded candidate stages. (arxiv.org)