Retrieval at Scale | Drop for 2026-06-13

TL;DR

  • Weaviate 1.38.0 is GA: HFresh (disk‑oriented vector index) graduates to GA, with Namespaces and Nested Object Filtering in preview for multi‑tenant and semi‑structured workloads. (github.com)
  • Milvus 2.6.18 (June 5) adds nullable vectors, element‑level search over Struct arrays, HTTP/2 for the proxy, and scheduler/stability improvements—useful for hybrid and constantly updated corpora. (milvus.io)
  • FLASH‑MAXSIM (arXiv, May 28) introduces fused GPU kernels for MaxSim in late‑interaction models, reporting up to 3.9× (A100) / 4.7× (H100) speedups and large memory cuts with exact ranking preserved. (arxiv.org)
  • Vespa shows classic sparse still moves: a BM25+features recipe lifted MSMARCO MRR@10 by ~10.8% while generalizing better out‑of‑domain. (blog.vespa.ai)
  • Azure AI Search adds a “retrieval reasoning effort” knob (preview 2026‑05‑01 API) to trade latency/cost vs. depth in agentic retrieval, including one‑shot iterative refetch for harder queries. (learn.microsoft.com)

Weaviate 1.38.0 GA: HFresh graduates; Namespaces and Nested Object Filtering (preview)

  • Key facts and current state of the topic
    • Weaviate 1.38.0 is now generally available; HFresh, a disk‑oriented index inspired by SPFresh, moves to GA for fresher, lower‑RAM retrieval. Namespaces (control‑plane/data isolation) and Nested Object Filtering are introduced in preview. (github.com)
  • Important context and background information
    • Late‑interaction and heavy filtered workloads often outgrow in‑RAM HNSW. A disk‑optimized first stage plus multi‑tenant isolation reduces tail‑latency variance and simplifies large multi‑team clusters. (github.com)
  • Recent developments or changes
    • The 1.38.0 release notes enumerate HFresh performance/footprint optimizations and initial Namespaces/Nested‑filtering APIs—plan canaries on filter‑heavy shards before broad rollout. (github.com)

Milvus 2.6.18 (June 5): nullable vectors, element‑level Struct search, HTTP/2

  • Key facts and current state of the topic
    • Milvus v2.6.18 adds nullable vector fields (skip missing embeddings at query time) and element‑level vector search over Struct arrays (returns the matched element/offset). It also enables HTTP/2 on the proxy and improves QueryNode/QueryCoord scheduling. (milvus.io)
  • Important context and background information
    • These features help hybrid/near‑real‑time pipelines where embeddings arrive asynchronously and where fine‑grained signals (arrays/Structs) matter for relevance or compliance tagging. (milvus.io)
  • Recent developments or changes
    • Release date: June 5, 2026. The notes call out deadline‑aware admission, better recovery under load, and reduced allocations—aimed at steadier p95/p99 during spikes. (milvus.io)

FLASH‑MAXSIM: fused GPU kernels for late‑interaction (MaxSim) scoring

  • Key facts and current state of the topic
    • Late‑interaction (e.g., ColBERT/ColPali) is memory‑bound when materializing full token‑by‑token similarity tensors. FLASH‑MAXSIM fuses compute and reduction, avoiding tensor materialization. (arxiv.org)
  • Important context and background information
    • Production late‑interaction stacks often hit GPU memory/throughput ceilings in both training and serving; IO‑aware kernels can unlock larger batches and corpora without changing model semantics. (arxiv.org)
  • Recent developments or changes
    • The paper reports up to 3.9× (A100) / 4.7× (H100) speedups, ~16× less inference memory and ~28× less training memory, with 100% top‑20 agreement vs. FP32 reference—worth piloting for ColBERT‑class rerankers. (arxiv.org)

Vespa: BM25 (+ rank features) re‑examined—10.8% MRR@10 lift, better generalization

  • Key facts and current state of the topic
    • Vespa’s May 29 analysis re‑implements BM25 on MSMARCO and shows adding a small set of rank features lifts MRR@10 from 0.1901 to 0.2106 (~10.8%) while doubling generalization on a held‑out set. (blog.vespa.ai)
  • Important context and background information
    • For ad/search use cases, lexical/sparse baselines (and feature engineering) still provide strong, cost‑efficient candidates and can ease pressure on ANN/LLM budgets. (blog.vespa.ai)
  • Recent developments or changes
    • The write‑up includes configs/plots; consider hybrid BM25+sparse/dense candidates with lightweight learned rerankers before heavier stages. (blog.vespa.ai)

Azure AI Search: “retrieval reasoning effort” (preview) for agentic retrieval

  • Key facts and current state of the topic
    • A new preview setting (2026‑05‑01 API) lets you pick minimal/low/medium “reasoning effort” for agentic retrieval—controlling LLM‑based query planning, subquery fan‑out, and an optional second pass when initial results are weak. (learn.microsoft.com)
  • Important context and background information
    • Useful when tuning recall/latency/token‑spend trade‑offs for mixed knowledge sources; “medium” can trigger a classifier‑gated iterative search and higher ranking limits in supported regions. (learn.microsoft.com)
  • Recent developments or changes
    • Docs updated June 11, 2026; available via knowledge‑base defaults or per‑request overrides. Feature remains preview—validate behavior and costs before production. (learn.microsoft.com)