Retrieval at Scale | Drop for 2026-06-13

TL;DR

Weaviate 1.38.0 is GA: HFresh (disk‑oriented vector index) graduates to GA, with Namespaces and Nested Object Filtering in preview for multi‑tenant and semi‑structured workloads. (github.com)
Milvus 2.6.18 (June 5) adds nullable vectors, element‑level search over Struct arrays, HTTP/2 for the proxy, and scheduler/stability improvements—useful for hybrid and constantly updated corpora. (milvus.io)
FLASH‑MAXSIM (arXiv, May 28) introduces fused GPU kernels for MaxSim in late‑interaction models, reporting up to 3.9× (A100) / 4.7× (H100) speedups and large memory cuts with exact ranking preserved. (arxiv.org)
Vespa shows classic sparse still moves: a BM25+features recipe lifted MSMARCO MRR@10 by ~10.8% while generalizing better out‑of‑domain. (blog.vespa.ai)
Azure AI Search adds a “retrieval reasoning effort” knob (preview 2026‑05‑01 API) to trade latency/cost vs. depth in agentic retrieval, including one‑shot iterative refetch for harder queries. (learn.microsoft.com)

Key facts and current state of the topic
- Weaviate 1.38.0 is now generally available; HFresh, a disk‑oriented index inspired by SPFresh, moves to GA for fresher, lower‑RAM retrieval. Namespaces (control‑plane/data isolation) and Nested Object Filtering are introduced in preview. (github.com)
Important context and background information
- Late‑interaction and heavy filtered workloads often outgrow in‑RAM HNSW. A disk‑optimized first stage plus multi‑tenant isolation reduces tail‑latency variance and simplifies large multi‑team clusters. (github.com)
Recent developments or changes
- The 1.38.0 release notes enumerate HFresh performance/footprint optimizations and initial Namespaces/Nested‑filtering APIs—plan canaries on filter‑heavy shards before broad rollout. (github.com)

Key facts and current state of the topic
- Milvus v2.6.18 adds nullable vector fields (skip missing embeddings at query time) and element‑level vector search over Struct arrays (returns the matched element/offset). It also enables HTTP/2 on the proxy and improves QueryNode/QueryCoord scheduling. (milvus.io)
Important context and background information
- These features help hybrid/near‑real‑time pipelines where embeddings arrive asynchronously and where fine‑grained signals (arrays/Structs) matter for relevance or compliance tagging. (milvus.io)
Recent developments or changes
- Release date: June 5, 2026. The notes call out deadline‑aware admission, better recovery under load, and reduced allocations—aimed at steadier p95/p99 during spikes. (milvus.io)

Key facts and current state of the topic
- Late‑interaction (e.g., ColBERT/ColPali) is memory‑bound when materializing full token‑by‑token similarity tensors. FLASH‑MAXSIM fuses compute and reduction, avoiding tensor materialization. (arxiv.org)
Important context and background information
- Production late‑interaction stacks often hit GPU memory/throughput ceilings in both training and serving; IO‑aware kernels can unlock larger batches and corpora without changing model semantics. (arxiv.org)
Recent developments or changes
- The paper reports up to 3.9× (A100) / 4.7× (H100) speedups, ~16× less inference memory and ~28× less training memory, with 100% top‑20 agreement vs. FP32 reference—worth piloting for ColBERT‑class rerankers. (arxiv.org)

Key facts and current state of the topic
- Vespa’s May 29 analysis re‑implements BM25 on MSMARCO and shows adding a small set of rank features lifts MRR@10 from 0.1901 to 0.2106 (~10.8%) while doubling generalization on a held‑out set. (blog.vespa.ai)
Important context and background information
- For ad/search use cases, lexical/sparse baselines (and feature engineering) still provide strong, cost‑efficient candidates and can ease pressure on ANN/LLM budgets. (blog.vespa.ai)
Recent developments or changes
- The write‑up includes configs/plots; consider hybrid BM25+sparse/dense candidates with lightweight learned rerankers before heavier stages. (blog.vespa.ai)

Key facts and current state of the topic
- A new preview setting (2026‑05‑01 API) lets you pick minimal/low/medium “reasoning effort” for agentic retrieval—controlling LLM‑based query planning, subquery fan‑out, and an optional second pass when initial results are weak. (learn.microsoft.com)
Important context and background information
- Useful when tuning recall/latency/token‑spend trade‑offs for mixed knowledge sources; “medium” can trigger a classifier‑gated iterative search and higher ranking limits in supported regions. (learn.microsoft.com)
Recent developments or changes
- Docs updated June 11, 2026; available via knowledge‑base defaults or per‑request overrides. Feature remains preview—validate behavior and costs before production. (learn.microsoft.com)