~/beyond-clip — field-guide.sh
user@multimodal:~$ cat field_guide.md | render --theme=geek

Beyond CLIP

A first-principles field guide to multimodal embedding — from CLIP to the post-embedding agent stack. Not a neutral survey. A field manual for those who train, ship, and operate retrieval at production scale.

◉ UPDATED · APR 2026 📄 100+ PAPERS ⟁ 2ND EDITION ◆ OPEN FIELD GUIDE
NOTE · READ ME FIRST

This is not a neutral survey. It is a field guide for people who build, train, or deploy multimodal retrieval at production scale. It takes positions, names the trade-offs other surveys hide, and flags where the community is fooling itself. Every quantitative claim is either tied to a primary source or explicitly marked as a snapshot.

TL;DRFour claims

C–L–I is the only map

Every architectural debate since CLIP is a point on the Compression · Localization · Instruction Pareto surface. "Dual-tower vs MLLM", "single-vector vs late interaction", "zero-shot vs instructed" are three projections of the same triangle. A production stack is never one model — it is a C first stage, a L structured stage for documents, and an I reasoner rerank. The triangle is not a choice, it is a stack.

Era 3 = MLLM encoder + late interaction

An 8B Qwen3-VL-Embedding hits 77.82 on MMEB-V2 (Apr 2026) — roughly ten points above any pure dual-tower. Nemotron-ColEmbed-V2-8B leads ViDoRe V3 at 63.42 NDCG@10 — economically impractical until MUVERA (NeurIPS 2024) compressed multi-vector ~32× with ~10% higher recall and ~90% lower latency than PLAID on BEIR subsets. Both unlocks are real; neither is universal.

Benchmarks lie — by design and by use

MMEB-V2 top-20 span is under one architectural generation. Most "new SOTA" is instruction-format tuning on data that overlaps with training. On held-out splits public numbers drop several points. A single MMEB-V2 score quoted without CSV-vs-paper provenance, without held-out verification, is meaningless; gaps below ~2 points are noise. The next real breakthrough is evaluation, not models.

Compute migrates to rerank

Every 2025–2026 winner (MoCa, CoCoA, TTE, UniME-V2, RzenEmbed, AutoThinkRAG) spends compute where 2023 ignored: bidirectional CPT, EOS-reconstruction pretext, MLLM-as-judge negatives, reasoning traces before embedding. Extra thinking at rerank → +7.5 NDCG@10; at expansion → +1.1; inside retriever → ≈0. Scaling the encoder is over; the next five years are about where in the pipeline the compute lives.


Part IThe C–L–I Framework

A multimodal retriever maps a heterogeneous item (image, text, page, clip, audio, …) into a geometry in which semantically similargeometrically close. Every architectural choice trades along three near-orthogonal axes:

AxisNameWhat it controlsCheap if…Expensive if…
CCompressionSemantics fit into a single fixed-dim vectorCorpus semantics simple per item; queries are globalItems are dense (pages, long video, multi-entity scenes)
LLocalizationGranularity of within-item alignment (1-vec ↔ multi-vec ↔ token MaxSim)Items have few sub-units; MaxSim compute is cheapItems hold many retrievable sub-units (words/frames)
IInstruction adaptivityEncoder output depends on a per-query task instructionWorkload is uniformHeterogeneous RAG with per-query task semantics
C · Compression single-vector capacity L · Localization within-item granularity I · Instruction task conditioning CLIP · SigLIP 2 · MetaCLIP 2 one 512–1024d vec · uniform photo retrieval ColPali · ColQwen3 · Nemotron-ColEmbed 1000+ vecs/item · visual documents Qwen3-VL-Embedding · GME · VLM2Vec · Seed1.6 2–4k-d vec · task-conditioned Think-Then-Embed · UniME-V2 · AutoThinkRAG externalized reasoning · very high I Production stack = C-first · L-middle · I-rerank
FIG 1.1 · The C–L–I triangle. Every encoder is a point; every production pipeline is a path.

Four representative systems, projected on the triangle:

System familyCLISweet spot
CLIP / SigLIP / jina-clip / MetaCLIP 2high
(one 512–1024d vec)
nonenoneUniform photo retrieval, classification, fast first stage
ColPali / ColQwen3 / Nemotron-ColEmbedlow (1000+ vecs/item)high (token-patch MaxSim)low–midVisual documents, slides, dense pages
E5-V / VLM2Vec / GME / Qwen3-VL-Embedding / Seed1.6medium (one 2–4k-d vec)lowhigh (task-conditioned)Instructed retrieval, heterogeneous RAG, composition
Think-Then-Embed / UniME-V2 / AutoThinkRAGmediumlow–midvery high (externalized reasoning)Reasoning-heavy, ambiguous, compositional queries

Three load-bearing properties of the frame, each with its falsifier. The axes are near-orthogonal but not independent — pushing C weakly helps L and I; pushing I barely helps C or L; most papers concentrate gains on one axis. A single architectural change that lifts all three by comparable magnitudes at fixed budget would refute this; none has been reported. Every benchmark selects an axis. MMEB-V2 is I-heavy; ViDoRe is L-heavy; MIEB tries to balance C across eight categories; Flickr/MSCOCO weight C almost alone. A benchmark whose per-task scorecard cleanly separates C/L/I contributions would falsify the selectivity claim; MIEB gestures toward this but does not close it. Production stacks always hybridize. The 2026 retrieval pipeline is C-axis first stage (SigLIP 2 / jina-clip-v2 / MetaCLIP 2) + L-axis structured stage for document corpora (ColQwen3 / Nemotron-ColEmbed-V2) + I-axis reranker or reasoner (Qwen3-VL-Embedding / Think-Then-Embed / MLLM-as-judge). A single-model deployment that dominates heterogeneous workloads without a separate reranker or dual index would falsify this; its appearance would reorganize the field.

Five years of history, in three moves

ERA 1 · 2021 – 2023
Push C, ignore L and I

CLIP (2021) → OpenCLIP → SigLIP → EVA-CLIP-18B. Single-vector dual-tower, contrastive on web image-text pairs. It worked, and it hit a ceiling: no localization, no instruction conditioning, saturation on anything compositional (ARO, MMVP), long-text (Long-CLIP), or high-resolution (PS3).

ERA 2 · 2023 – 2025
Keep C, seed L and I

Three parallel waves: data replaces architecture as the primary lever (VeCLIP, LaCLIP, DAC, DFN, DCI, DreamLIP, Long-CLIP — +5–10 R@1 at fixed architecture); L gets a foothold with ColPali (Jun 2024), first credible demo that patch-level MaxSim beats global-vector retrieval on visual docs; I takes root with UniIR, MMEB-V1, MagicLens, M-BEIR, then E5-V / VLM2Vec turning MLLMs into instruction-conditioned embedders.

ERA 3 · 2025 – 2026
I becomes first-class

MLLM contrastive fine-tunes produce 2–4k-d vectors with materially higher capacity than Era-1 dual-towers; instruction conditioning becomes table stakes; MUVERA + 2025–2026 DB wave turns late interaction from niche tool to default primitive for large, dynamic corpora; reasoning-augmented embedders (TTE, UniME-V2, AutoThinkRAG) move compute to inference time, ahead of the embedding itself. Era-3 retrieval is a stack, not a model.

Five years · three structural moves ERA 1 · 2021–2023 Push C · ignore L+I Dual-tower contrastive era CLIP OpenCLIP SigLIP EVA-CLIP ERA 2 · 2023–2025 Keep C · seed L+I Data > architecture; ColPali; MMEB VeCLIP MMEB-V1 ColPali VLM2Vec MUVERA ERA 3 · 2025–2026 I first-class · C unlocked · L practical MLLM encoders + reasoning rerank MoCa TTE UniME-V2 Qwen3-VL Nemotron 2021 2024 2026
FIG 1.2 · From contrastive dual-towers to reasoning-augmented MLLM stacks.
IMPORTANT

From here on, read every section as answering a single question: which axis is being pushed, at what cost to the other two?


Part IIThe Two Unlocks: MLLM Encoders & Late Interaction

Era 3 is defined by two structural changes that happen to be near-orthogonal on C–L–I: a causal decoder turning out to be a better C+I encoder than any bidirectional alternative, and late interaction turning out to be affordable at scale. Each solves a problem the other cannot touch. The modality gap — a long-standing worry about the CLIP geometry — is dissolved as a side effect of both.

2.1 Why a causal MLLM is secretly a good encoder

On paper, using a causal decoder for retrieval is strange. Attention is triangular; the last token sees everything but nothing sees the last token; retrieval similarity is symmetric and causal LMs are asymmetric by construction. Yet E5-V, VLM2Vec, GME, Seed1.6-Embedding, Qwen3-VL-Embedding, KaLM-Embedding, MoCa, RzenEmbed — all follow the same recipe (run a forward pass, pool the last hidden state, done) and beat every bidirectional alternative at comparable scale. Four mechanisms explain why; a fifth explains where it breaks.

The last-token bottleneck is a feature, not a bug. Next-token prediction forces every preceding token to contribute signal to the final position — otherwise the predictor cannot be coherent. A well-pretrained causal LM therefore already contains, hidden inside it, an encoder whose output is the last hidden state: a task-agnostic compressed summary. Contrastive fine-tuning rotates this summary into retrieval-friendly subspace; it rarely has to build the summary from scratch. This is Jiang et al.'s argument for text (LLM2Vec); it generalizes to multimodal inputs because cross-modal tokens pass through the same residual stream.

Cross-modal interleaving shares the compression subspace across modalities. In CLIP, image and text hidden states are produced by different networks glued by a shared projection. In an MLLM embedder, they pass through the same transformer stack, interleaved in a single residual stream, with attention free to mix them at every layer. The functional subspace used for retrieval is therefore shared by construction. This is the C-axis property that dissolves the modality gap (Liang et al., NeurIPS 2022) at the pooled token: CLIP's narrow disjoint cones are a free-energy minimum of the dual-tower contrastive loss landscape — no gradient reason to close them. MLLMs do not abolish the gap (hidden states still cluster by modality layer-by-layer) but they relocate it to the subspace used for retrieval, which is all retrieval needs. The papers that explicitly address modality geometry — MoCa, UniME, Kuaishou's Modality Curation — pick up additional points on cross-modal compositional queries where the residual gap still bites.

Instruction prompts act as soft task heads. The E5-V prompt Summarize the above image in one word: does the job of a "task head" in CLIP's text tower — it tells the model which retrieval task it is performing and empirically routes information through different parts of the last hidden state. MMEB-style per-task instructions generalize this: the encoder learns a small, fast router from instruction to retrieval subspace. This is pure I-axis capability with no CLIP analogue.

The minority report: bidirectional beats causal at the margin. Two independent groups now show that flipping continual pretraining to bidirectional attention plus a masked or EOS-reconstruction pretext recovers measurable points. MoCa (arXiv 2506.23115, Renmin U. + Microsoft Research Asia) runs a modality-aware continual pretraining stage with bidirectional attention + text MLM + image MAE, then contrastive fine-tuning: MoCa-7B reaches 71.5 on MMEB-V1 vs 67.1 for a bidirectional-only baseline (≈ +4.4) and vs 69.8 for mmE5. CoCoA (arXiv 2603.01471, ICT/CAS + Baidu) replaces the pretext with collaborative attention + EOS-based reconstruction, reaching 70.6 at 7B. The effect is real, reproducible across objectives, and cheap — roughly one extra pretraining epoch. Open base models (Qwen3-VL, InternVL3) are still purely causal. This is low-hanging fruit; see §VI for why it lands by Q4 2026.

The 2026 turn: compute before the embedding. A newer family shows the next axis beyond bigger encoders. Think-Then-Embed (TTE) (arXiv 2510.05014, Meta + UCF + NYU, ICLR 2026) first generates a reasoning trace with a small MLLM reasoner, then conditions the embedder on both the query and the trace — reported 7% absolute MMEB improvement over recently proposed baselines on the 78-task average. UniME-V2 (arXiv 2510.13515, Alibaba, AAAI 2026) replaces the brittle similarity heuristic in contrastive hard-negative mining with MLLM-as-a-judge soft scoring. RzenEmbed (arXiv 2510.27350, Qihoo 360) adds hardness-weighted contrastive loss; AutoThinkRAG (arXiv 2603.05551) routes by query complexity and reports 82.13% on DocBench with 18.9% fewer per-query tokens than naïve multimodal RAG. Each is a migration of compute from training to inference, from encoder to reasoner — the same dynamic that played out in chat LLMs with o1/Claude reasoning modes, arriving in retrieval roughly one year later. The 2026–2027 recipe almost certainly combines causal MLLM backbone + bidirectional continual pretraining + inference-time reasoning trace + MLLM-as-judge hard-negative mining. Anyone selling one lever is selling a quarter of the answer.

2.2 Late interaction as within-item localization

ColPali / ColQwen3 / ColNomic / Nemotron-ColEmbed-V2 dominate visual-document retrieval (Nemotron-ColEmbed-V2-8B leads ViDoRe V3 at 63.42 NDCG@10 per NVIDIA's Feb-2026 pipeline snapshot) and barely move Flickr30K / MSCOCO text↔image retrieval. Under C–L–I this is exactly what you expect: a PDF page rendered to pixels produces ~1,000 patches, each aligned to a retrievable sub-unit (word, cell, axis label). MaxSim over query tokens against these patch tokens is ColBERT on a document that happens to be pixels. On photographs the same patches correspond to texture and lighting — units that are not individually retrievable. Page queries are multi-concept ("Q3 revenue in Europe") and MaxSim aggregates multi-token relevance; photo queries are typically single-concept and a single global vector losslessly compresses them.

This is a framing point the community keeps missing: the architecture community calls this "late interaction"; the retrieval community should call it within-item localization. It is a different problem that happens to share the dense-vector toolkit. L-axis is useful only when items have queryable sub-units; on anything else, the ~1000× storage tax is unrecoverable.

2.3 The MUVERA era

Late interaction's Achilles' heel was always storage. At 10M pages × ~1,030 tokens × 128-d fp16 ≈ ~2.6 TB of raw vectors before any compression, and MaxSim compute scales linearly in tokens. Three waves of compression closed the gap:

TechniqueYearEffectResidual cost
PLAID centroid clustering (ColBERT-v2)2023~8–10× storage reduction; candidate-filter + rerankSlightly slower query path
Binary / int8 quantization of MaxSim tensors2024~4–8× storage; recall –1 to –3 ptsQAT helps
MUVERA: Fixed Dimensional Encodings (arXiv 2405.19504, NeurIPS 2024)2024 → DB 2025–26Combined with PQ-256-8: ~32× storage vs raw at negligible quality loss. Vs PLAID on 6 BEIR subsets: ~10% higher recall, ~90% lower latency. Weaviate's blog reports ~70% memory reduction on LoTTE.Requires DB support
MUVERA · Fixed Dimensional Encoding variable multi-vector set → single fixed-dim vector BEFORE · MULTI-VECTOR ColPali page · 1,030 tokens × 128d ~2.6 TB / 10M pages MaxSim O(Nq × Nd) per query MUVERA · FDE fixed-dim projection + PQ-256-8 quantize AFTER · FDE + PQ-256-8 one fixed vec · inner product ~80 GB / 10M pages any ANN index · cheap inner product ~32× STORAGE · +10% RECALL · −90% LATENCY
FIG 2.1 · MUVERA projects variable-length multi-vector sets into a single fixed-dim vector that preserves MaxSim-style relevance under inner product.

MUVERA's mechanism is worth stating plainly: it projects a variable-length multi-vector set into a single fixed-dimensional vector compatible with any standard ANN index, preserving MaxSim-style relevance under inner product. The breakthrough was not published in 2024 — it was that 2025–2026 DB integration turned the model-side choice into a configuration toggle. Weaviate 1.31 exposes it as a first-class Encoding.muvera(...) option; Qdrant's FastEmbed line treats it as an encoder-side post-processing step; Vespa provides native tensor MaxSim; LanceDB and Milvus are catching up. Before this wave, L-axis retrieval was a tool for medium, static, high-value document corpora. After it, L-axis retrieval is a candidate for large, dynamic corpora — and the C-axis single-vector default starts looking weaker on document-heavy workloads.

The 2026 practitioner defaults fall out cleanly. Natural-image retrieval (photos, social content, products) stays single-vector; late interaction offers no material gain and still costs ~30× storage even post-MUVERA. Visual-document retrieval (PDFs, slides, UI screenshots, scans, charts, diagrams) is now decisively late-interaction; ColQwen3-7B or Nemotron-ColEmbed-V2-8B is mainstream where it was niche 18 months ago. Mixed corpora (a web page with both a photo and a table) demand a two-track index — single-vector for the photo, multi-vector for the document segments. No single model handles both optimally, and MUVERA does not change that — it just makes the different problem affordable at scale.

Era-3 production stack · C→L→I cascade INPUT Query text / image / hybrid + instruction STAGE 1 · C Single-vector first stage SigLIP 2 · jina-clip-v2 MetaCLIP 2 STAGE 2 · L Late interaction (+ MUVERA) ColQwen3 Nemotron-ColEmbed-V2 STAGE 3 · I Reasoner / reranker +7.5 NDCG@10 Qwen3-VL-Embedding TTE · MLLM-as-judge OUTPUT Answer top-k + citations top-10K top-100 top-10 cheap · broad expensive · precise
FIG 2.2 · The triangle is not a choice — it is a stack. Each stage narrows candidates while increasing per-item compute.

Part IIIData, Instruction, Benchmarks: Why the Numbers Lie

This section merges three debates that are usually treated separately but are in fact one: how do we know a model is good, and what does "good" reward? Data, instruction conditioning, and benchmark design are the three knobs through which a paper's headline number gets manufactured, and all three are now contaminated enough that a single score without provenance is indistinguishable from a marketing claim.

3.1 The data stratification, and its ceiling

Every paper since 2024 attributes >80% of gains to data engineering. This is correct in a narrow regime and overclaimed everywhere else. The decomposition that holds across regimes:

RegimeDominant driverRepresentative evidence
< 500M image-text pairsArchitectureSigLIP sigmoid loss, EVA-CLIP masking, FLIP token drop — each delivers 3–8 pt gains regardless of data volume
500M – 5B pairsData cleaning + re-captioningVeCLIP / DreamLIP / LaCLIP / DAC / DFN / DCI / Long-CLIP: alt-text → VLM caption lifts R@1 by 5–10 pts
5B – ~50B pairs, decent captionsSynthetic hard negatives + task mixturesMegaPairs / mmE5 / Modality Curation / B3 — data shape (mixture across modalities and tasks) beats data volume
> 50B tokens, rich task formatsInference-time techniquesSeed1.6 / Qwen3-VL territory — instruction engineering, MLLM-as-judge hard-negative mining (UniME-V2), reasoning augmentation (TTE), prompt routing

This resolves the apparent contradiction between "scale is all you need" (DataComp, LAION) and "data quality is all you need" (DFN, Seed1.6). Both are right at different altitudes. The MegaPairs → mmE5 → Modality-Curation arc is the one that matters for 2026 practitioners: MegaPairs (BAAI, Dec 2024) synthesized hundreds of millions of image-image-text triplets with no human labelers, ending the "labelling-bounded" myth; mmE5 (Renmin U. + Microsoft, Feb 2025) showed multilinguality is nearly free if you synthesize; Modality Curation (NEU + Kuaishou, May 2025) showed batch composition matters, yielding free MMEB points. The collective — and rarely stated — implication: the data pipeline has replaced the architecture as the primary IP of an embedding lab. "Reproduce from our weights and code" has quietly become the gold standard, because "reproduce from scratch" is no longer feasible outside well-funded labs.

The ceiling nobody quite admits: once the 78-task MMEB mixture is saturated with synthetic + mined pairs, the marginal point gets expensive enough that inference-time compute (reasoning, MLLM-as-judge rerank, test-time retrieval) is now cheaper per point than more training data. This is the regime TTE, UniME-V2, AutoThinkRAG exploit. Expect the 2026–2027 paper balance to tilt back toward architecture + inference, away from data volume — a full cycle from 2022.

3.2 Instruction conditioning: product feature disguised as modeling technique

Every competitive 2025–2026 model exposes an instruction field. MMEB-V2 bakes 78-task instructions into the eval. The quiet, ugly consequence: the model learns to branch on the instruction, and leaderboard numbers start reflecting how well it switches tasks rather than how well it represents content. Strip the instruction and many 2026 SOTA models drop several points on zero-shot — a gap that rarely appears in press releases. Small prompt rewordings can move NDCG@5 by several points; this is under-reported. Three observations that practitioners keep relearning the hard way: instructions help only if your workload actually provides them; zero-shot and instructed retrieval are different skills (MagicLens, UniME, ColPali do both well; heavily-instructed models lean on the prompt branch); style drift is a production risk. Under C–L–I, instruction conditioning is a pure I-axis lever — valuable where your workload has task heterogeneity, a net loss where it does not.

3.3 Benchmarks: saturation, contamination, and what a score actually means

From MMEB-V1 (Nov 2024) to MMEB-V2 (Apr 2026), the top score moved explosively in the first year and is now saturating. Three symptoms. Per-task ceiling effects — on the easier MMEB-V2 tasks, the top-20 models are within 1–2 points, and further progress is noise. Instruction-format overfitting — several submissions gained points by tuning the MMEB task-prompt formatting without retraining the encoder. Data leakage — the current generation is trained on corpora that plausibly include MMEB-related train splits (MMEB reuses M-BEIR, UniIR, and public retrieval sets). Not fraud; the normal hygiene problem every benchmark eventually faces (ImageNet, GLUE, SuperGLUE). But in a field moving this fast, leaderboards overstate generalization; an external production deployment typically sees the public MMEB-V2 score minus 3–5 points.

Under C–L–I, the differences between benchmarks stop being mysterious:

BenchmarkPrimary axisWinning architecture (Apr 2026)
MMEB-V2 (TIGER-Lab)I (instruction + composition + task-switching)Large MLLM + rich instruction data (Qwen3-VL-Embedding-8B, Seed1.6-Embedding)
MIEB (MTEB team)Balanced C (encoder quality across 8 categories)Balanced MLLMs + strong classical encoders
ViDoRe V3 (Illuin)L (localization-heavy visual-doc retrieval)Late interaction (Nemotron-ColEmbed-V2, ColQwen3)
M-BEIR (TIGER-Lab)I (instruction-aware multimodal retrieval)Reasoning-augmented retrievers (TTE, RAR)
MTEB v2 (MTEB team)C on text, partial IText-only giants (Harrier-OSS-27B, KaLM-Embedding-Gemma3-12B)
CoIR (ACL 2025)Code retrieval CSFR-Embedding-Code-2B_R (≈ 67.4 aggregate, per model card)

Cross-leaderboard snapshot — April 2026

Every number is a single-configuration snapshot from a primary source (leaderboard CSV, model card, or technical report). They move week-to-week. Gaps <2 points are noise.

Model MMEB-V2 MTEB v2 (ML) ViDoRe V3 License Axis
Qwen3-VL-Embedding-8B77.82
CSV #1
67.9Apache-2.0C+I
Seed1.6-Embedding-121575.08
CSV #2
Closed APIC+I
IFM-TTE-7B (Think-Then-Embed)73.06 / 74.1
CSV / paper
OpenI+reason
RzenEmbed-v2-7B71.12 / ~72.9Weights on HFC+I
UniME-V2 (LLaVA-OV-7B)71.2 / 59.12
paper / CSV
OpenI
VLM2Vec-V2-2B58–59Apache-2.0C+I
GME-Qwen2-VL-7B57.8Apache-2.0C+I
Nemotron-ColEmbed-V2-8B63.42
#1
OpenL
Nemotron-ColEmbed-V2-4B61.54OpenL
Harrier-OSS-v1-27B74.3Apache-2.0C (text)
KaLM-Embedding-Gemma3-12B72.32MITC (text, ML)
Gemini-Embedding-00168.37Closed APIC (breadth)

Why two MMEB-V2 columns for some rows? The leaderboard CSV penalizes models that skip subsets more than the authors' own Table numbers do. For a fair within-column comparison, pick the CSV. For a paper-claim comparison, pick the Table. Always name which. Corrections to community folklore: RzenEmbed and UniME-V2 do not sit at MMEB-V2 ~77; TTE is mid-70s, not 76; the 76.9 sometimes quoted for Seed1.6 is the Qwen3 paper's 78-task recomputation, not a MIEB ranking. Never quote a single number without naming the axis, the CSV-vs-paper distinction, and whether the evaluation used held-out splits. This is not pedantry — it is the only defense against a saturating benchmark.


Part IVDeployment Economics

Leaderboards never show the numbers that dominate a production budget. What follows is a single-configuration baseline on one H100 SXM, fp16, batch tuned per model, CUDA 12.x, realistic image distribution. Treat ±20% as noise.

ModelAxisVRAMIndex 10M imgsp99 latencyStorage (10M × vec)
SigLIP 2 SO400M-LC~18 GB~6 h~8 ms~23 GB (1152d)
MetaCLIP 2 G/14C~40 GB~16 h~15 ms~26 GB (1280d)
jina-clip-v2C~24 GB~8 h~12 ms~20 GB (1024d)
GME-Qwen2-VL-2BC+I~36 GB~40 h~55 ms~31 GB (1536d)
GME-Qwen2-VL-7B v2C+I~60 GB~5 d~120 ms~72 GB (3584d)
Qwen3-VL-Embedding-8BC+I~60 GB~6–8 d~140 ms~72 GB (3584d)
ColPali v1.3L~22 GB~10 h~30 ms + MaxSim~2.6 TB (raw)
ColQwen3-7BL~48 GB~6 d~110 ms + MaxSim~2.6 TB (raw)
Nemotron-ColEmbed-V2-8BL~56 GB~7 d~130 ms + MaxSim~2.6 TB (raw)
+ MUVERA / FDELsamesamesingle inner product~80 GB (~32× smaller)

Three rules of thumb. Every 2× in encoder parameters → roughly 2.5–3× in indexing and 1.5–2× in query latency — MLLMs pay in dynamic resolution and variable-length attention. Multi-vector without compression ≈ 1000× storage vs single-vector; MUVERA cuts this to ~30×, which is the regime where hybrid C+L stacks become a default rather than an exotic choice. Query latency in MLLM encoders is decode-bound — they are still autoregressively producing special embedding tokens, and vLLM/SGLang-style batched decoding offers several-× headroom that most stacks do not yet use. This is the first target of 2026-H2 inference-infra work (see §VI-Near).

Matryoshka + binary is not a free lunch. The marketing claim — "cut storage 64× with <1% quality loss" — holds only on the easiest workloads. MRL truncation (3584 → 512) costs ~1–2% on easy tasks and materially more on compositional / multi-entity / visual-document workloads; 1-bit binary quantization adds percentage points more. Published combined-loss-under-5% results typically come from classification-style probes, not retrieval on hard heterogeneous data. The way to recover the marketing claim is not to deny it — it is to binary-recall + full-precision rerank: retrieve top-100 with MRL+binary, re-score those 100 with fp16. Hybrid Recall@10 sits within ~1% of full precision at ≥32× smaller storage. Matryoshka + binary is a first-stage technique, not a full-stack technique; if you are not reranking, do not binarise.

What breaks when you put ColPali in production. Storage explosion is mitigated by MUVERA but the compressed index still costs ~30× a single-vector corpus — budget accordingly. Serving MaxSim is the DB question, not the model question: Qdrant ships native multivector + MaxSim and is the default for ColPali/ColQwen deployments; Weaviate 1.31 adds native MUVERA via Encoding.muvera(...); Vespa exposes tensor MaxSim; LanceDB has multi-vector columns with client-side MaxSim; Milvus 2.5 supports multi-vector fields with MaxSim as a group_by workaround, with 3.0 on the roadmap; FAISS reduces client-side. Pick the DB on this feature, not on headline benchmarks. Freshness is hard — each document is thousands of inserts, and bulk-reindex cadences that worked for single-vector break here, although MUVERA's compressed encoding updates incrementally like a normal vector. Chart and screenshot brittleness dominate over model choice on low-DPI inputs; input preprocessing matters more than which ColQwen you picked. And MaxSim at high QPS can become the bottleneck, not the encoder — MUVERA's ~90% latency reduction on the BEIR subsets is the only reason this is tractable at scale.

CAUTION · CHOOSE ON ECONOMICS

ColPali is the right answer for medium-sized, high-value, static or slow-changing document collections. Post-MUVERA, it is also defensible for large, dynamic corpora. For casual web-scale PDFs, ticket attachments, low-value user uploads, an OCR + text-embedding pipeline is often cheaper end-to-end. Choose architecture on corpus economics, not benchmark scores.


Part VOpen Weights, Closed APIs, and the Moats That Will Remain

The top of almost every 2026 multimodal leaderboard is populated by Chinese labs — Alibaba (Qwen3-VL, GME, UniME), ByteDance (Seed1.6), Tencent (KaLM, LLaVE), BAAI (MegaPairs, BGE), Ant (M2-Encoder, GroupRank), Kuaishou (Modality Curation, UniECS), Xiaohongshu (LamRA), Qihoo 360 (RzenEmbed). This is structural, not cyclical. Three reasons, stated without hyperbole. Strong open-weights MLLM bases in the 2–14B class — Qwen2.5-VL, Qwen3-VL, InternVL3, GLM-4.1V, DeepSeek-VL2 are strong open starting points, and embedding quality inherits directly from base quality. Vertical integration of data pipelines — ByteDance, Alibaba, Kuaishou, Xiaohongshu operate among the world's largest image-text interactive corpora; "data > architecture" structurally favors them. Incentive asymmetry — Chinese labs tend to publish open weights to stake international-reputation claims, while several Western labs gate their best models behind APIs. Public leaderboards therefore systematically over-represent open-weights Chinese output and undercount closed Western models (OpenAI, Anthropic, Cohere) that do not submit.

The narrative has to be stated carefully. The Western open-weights scene is not empty. Meta released Llama 4 (Scout / Maverick) as natively multimodal open weights in April 2025, and Llama 3.2 Vision earlier; a Llama-4-based embedder is obvious next. LLaVA-OneVision-1.5 (arXiv 2509.23661) competes with Qwen2.5-VL on several benchmarks. Microsoft's Harrier-OSS-27B tops text MTEB v2 at 74.3. Meta FAIR / NVIDIA / Google DeepMind still lead on narrow fronts (Perception Encoder, SigLIP 2, PS3, Nemotron-ColEmbed-V2). Salesforce + Waterloo (VLM2Vec), IBM (DAC), Apple (DFN, VeCLIP), Jina AI, Snowflake (Arctic-Embed 2.0), Cohere (Embed v4), Voyage (voyage-multimodal-3), Nomic all ship production-grade multimodal or text embedders outside China. The honest summary: the 2–14B open-weights multimodal embedding space is disproportionately Chinese, and the production-API breadth space is disproportionately Western. Neither dominates the other end.

Where closed APIs will keep their moats into 2027–2028 is the set of modalities that are data-gated rather than compute-gated. Audio embeddings and unified T+I+V+Audio+PDF: Gemini Embedding is the only credible unified multimodal product as of April 2026 — the experimental gemini-embedding-exp-03-07 reaches mean-task ≈ 68.32 on the Multilingual MTEB leaderboard, and no open-weights peer matches its modality breadth. Long-context PDFs (>200 pages, mixed text + figures) — Gemini and Voyage-multimodal-3 have bespoke tokenizers; open weights lag. 3D / CAD / point clouds — almost entirely absent from the open ecosystem. Tabular + image joint retrieval — Cohere Embed v4 handles text + images + mixed PDFs; truly native tabular encoding is rare on both sides. These are smaller markets than photo search, but they are where the closed-API moat compounds, because the training corpora themselves are not open-sourceable.

IMPORTANT · GEOGRAPHY OF MOATS

The Chinese-dominance narrative is half structural, half benchmark design. If Anthropic submitted a Claude multimodal embedder, or OpenAI submitted text-embedding-3-multimodal, the top of several leaderboards would shuffle. They haven't, because they don't compete on public benchmarks. If your use case involves audio, long PDFs, 3D, or tabular+visual joint retrieval, do not wait for open weights to catch up — they may not before 2028.


Part VIForecast 2026 → 2030 · Near, Mid, Far

Forecasts are only useful if they are falsifiable. Each bet below names a mechanism, a signal to track, and a falsifier. Probabilities are subjective and calibrated against the author's track record in adjacent fields. Brier-score them in 2027, 2028, 2030. Many bets in the far tier are load-bearing for the thesis of the guide itself: if the far tier is wrong, the C–L–I frame is probably over-stated.

Three horizons · falsifiable bets NEAR · 2026 Q3 – 2027 Q4 Bidir CPT MUVERA default Embed-vLLM MID · 2027 – 2028 Q4 Trust-tier eval Long-ctx ⊃ RAG Retrieval → tool FAR · 2029 – 2030+ Agent memory Living evals Multi-facet plumbing structural paradigm now 2030+
FIG 6.1 · Near = plumbing; Mid = structural; Far = paradigm.
Horizon · Near12–18 months · 2026 Q3 – 2027 Q4

Where compute moves and which plumbing ships

This horizon is about where compute moves and which plumbing ships. Bets here are dominated by mechanisms already demonstrated in papers and awaiting productization.

#BetP(true)Signal / Falsifier
N1Bidirectional continual pretraining is standard for top-5 MLLM embedders~75%At least one of {Qwen3-VL-Embedding v2, InternVL3-Embedding, Seed1.7-Embedding} ships with bidirectional continual or MLM-MAE pretext language in the card. Falsifier: all stay purely causal and still win by >2 pts
N2Test-time compute is allocated at rerank, not query expansion, not retriever-internal CoT~85%LlamaIndex / Haystack / DSPy default templates read "single-vector first + multi-vector middle + reasoning rerank on top-k". Falsifier: a published system shows >5 NDCG@10 gain from retriever-internal CoT on a held-out bench
N3MUVERA-class FDE is the default for new multi-vector deployments on the major DBs~70%Weaviate / Qdrant / Vespa / LanceDB default deployment docs recommend FDE. Falsifier: <30% adoption in late-interaction workloads by end-2026
N4An open-source Embed-vLLM / SGLang-embed stack yields ≥3× throughput on Qwen3-VL-Embedding vs HF Transformers~65%Public release with reproducible benchmark. Falsifier: none by mid-2027
N5Reasoning-augmented embedders (TTE/UniME-V2 family) hold ≥2 top-5 MMEB-V2 slots~70%MMEB-V2 CSV top-5 in Dec 2026. Falsifier: ≤1 TTE-style entry
N6A first alpha of a consolidated multimodal benchmark (MTEB-MM or equivalent) lands, with held-out splits and C/L/I scorecard~55%MTEB team RFC. Falsifier: nothing by Q2 2027
N7Llama-4-based or Claude-Haiku-based open-weights embedder enters MMEB-V2 top-10~55%Specifically, a Western 8–14B open-weights VLM-based embedder appears in the top-10. Falsifier: no Western open-weights entry in the top-10 through end-2026

The load-bearing implication of the near tier is that the 2026 recipe (causal MLLM + bidirectional continual + reasoning rerank + MUVERA) crystallizes as the default rather than an edge configuration. After that happens, further gains on MMEB-V2 are bounded by benchmark hygiene, not by model capability.

Horizon · Mid2–3 years · 2027 – 2028 Q4

What replaces the leaderboard, what squeezes out of long-context LLMs

This horizon is about what replaces the leaderboard and what squeezes out of long-context LLMs. Bets here are structural, not model-level.

M1 · Public multimodal benchmarks bifurcate into a "trust tier" (held-out, private split, C/L/I scorecard) and a "marketing tier" (public static leaderboards). Mechanism: saturation + contamination + instruction-format gaming have crossed the threshold where a score is uninterpretable without provenance; institutional buyers (cloud vendors, regulated industries) pay for trusted eval. P ≈ 65%. Falsifier: public MMEB / MTEB remain the referenced source for enterprise RFPs through 2028.

M2 · Long-context LLMs (>4M tokens) absorb a meaningful slice of "retrieval" tasks that are currently RAG. Mechanism: the point where dumping a 500-page document into context beats retrieving three chunks has moved from "theoretical" to "Gemini 2.5 in production" — the remaining economic constraint is inference cost, which halves roughly every 9–12 months. P ≈ 55% that >20% of current small-corpus RAG workloads migrate to in-context by end-2028. Falsifier: token-per-dollar inference cost fails to drop by 4× vs April 2026.

M3 · The line between "retrieval" and "tool use" dissolves at the agent layer. Mechanism: MCP-style tool protocols already treat "search the KB" as one capability among many; reasoning-augmented retrievers (§II.1) already run inference-time logic that is indistinguishable from a tool call. P ≈ 70% that the default agent framework by 2028 treats retrieval as a capability rather than a separable pipeline stage. Falsifier: standalone RAG frameworks (LlamaIndex, Haystack) continue as dominant paradigm and agent frameworks stay separate.

M4 · A unified open-weights T + I + V + Audio embedder reaches within 2 pts of Gemini Embedding's multilingual MTEB score on held-out splits. Mechanism: the data gap closes slowly but the model-side gap is one pretraining run. P ≈ 45% by end-2028. Falsifier: no open-weights model matches Gemini on any unified eval by 2028.

M5 · Embodied retrieval — for robots, AR glasses, and offline agents — becomes a distinct sub-field. Mechanism: the scene in front of a wearable device is a new kind of "corpus" (spatially indexed, time-streamed, identity-bound), and existing embedding stacks do not model the spatial or temporal structure. Early signals: Meta Aria, visual SLAM + VLM hybrids, Twelve Labs' video work. P ≈ 60% that a benchmark for this lands by 2028. Falsifier: no such benchmark exists and the use case remains niche research.

M6 · "Retrieval as preference alignment" — the retriever itself is RLHF'd on human or LLM judge signal, not just contrastive-trained. Mechanism: MLLM-as-judge in UniME-V2 is already a weak version; the next step is closing the loop with user-interaction data. Large consumer platforms (search, social, e-commerce) have the data moat; small labs do not. P ≈ 70% for at least one major retrieval paper per year in this direction starting 2027. Falsifier: contrastive InfoNCE remains the de-facto training objective across all top-10 papers through 2028.

M7 · Chip and energy geography splits the embedding stack into two non-interoperable lineages. Mechanism: export controls, national-AI policies, and domestic chip supply lines (Ascend / Cambricon vs NVIDIA) combine with sovereign data rules (EU AI Act, China's generative-AI provisions) to make cross-border deployment of certain embedder + corpus combinations legally or operationally infeasible. P ≈ 55% that by end-2028 a "sovereign-embedding" tier of vendors explicitly markets on compliance + locality. Falsifier: one global open-weights stack continues to dominate without legal fragmentation.

Horizon · Far3–5+ years · 2029 – 2030+

Where the C–L–I frame itself comes under pressure

This horizon is where the C–L–I frame itself comes under pressure. Far-tier bets are deliberately large-scope — falsifying them would reshape the field, not just reorder a leaderboard.

F1 · Embedding as a standalone artifact enters its late phase; "retrieval" becomes a subset of agent memory. Mechanism: four convergent trends kill the standalone embedder for new projects while leaving installed fleets alive for years. First, long-context LLMs swallow small-corpus RAG. Second, reasoning-augmented retrievers push the embedder into the reasoner's latent state — Think-Then-Embed generalizes to "embedding is a function of the reasoner's working memory at the moment of retrieval", not of the document alone. Third, agent memory architectures (episodic + semantic + procedural, à la Voyager, MemGPT, Letta) treat dense vectors as one substrate among several — structured KV, symbolic indices, code-as-memory. Fourth, MLLM-as-judge reranking eats the top of the quality distribution, leaving the embedder as a cheap candidate-gen layer. The practical effect: by 2030, greenfield retrieval projects will look like agent + tool registry + hybrid memory rather than encoder + ANN + reranker; "embedding model" becomes a commodity inside the agent, not the center of the stack. P ≈ 60%. Falsifier: standalone embedding APIs (OpenAI, Cohere, Voyage, Jina) grow their embedding-specific revenue share through 2030.

F2 · The CLIP / SigLIP / ColPali generation becomes "classic layer" infrastructure — paid for, under-maintained, everywhere. Mechanism: even as the frontier moves to agent memory, the installed base of vector indices, product search engines, recommendation pipelines, and academic benchmarks running on single-vector and ColPali encoders is measured in exabytes of persisted vectors. Migration cost is gigantic. Every previous ML generation (word2vec, ELMo, BERT base, ResNet-50) survived a decade past its SOTA obsolescence in production; this one will too. P ≈ 85%. Falsifier: a migration event (a major cloud deprecating CLIP-era encoders, a forced standardization) wipes the installed base by 2030.

F3 · Evaluation moves from static public leaderboards to living benchmarks and private evals; publication norms follow. Mechanism: (M1) plus the gradual realization that any public test set, once named in a tweet, is compromised within one training cycle. Living benchmarks — dynamically generated queries against held-out corpora, with periodic rotation — become the only reputationally defensible eval. Academic papers start citing distributions of scores rather than single numbers. P ≈ 55%. Falsifier: MMEB-V4 / MTEB v3 dominate publications in 2029 on a static-leaderboard model.

F4 · Governance and auditability become first-class features of embedding stacks. Mechanism: embeddings leak memorized content (membership inference is easier than on generative LLMs); vector databases hold personal data; multimodal embedders have copyright exposure on training images. Enterprise procurement already asks "is your training data auditable"; regulators will. By 2030, a production embedder ships with a training-data provenance document, a membership-inference bound, and a differential-privacy story, or it does not get bought by regulated industry. P ≈ 70%. Falsifier: no major enterprise procurement in 2029–2030 requires audit trails for embedding models.

F5 · The "one embedding per item" assumption dies for a large class of items. Mechanism: a single-vector representation of "a video", "a codebase", "a person", "a scientific paper" is a dimensional straitjacket. The successor is a bundle of representations (spatial, temporal, semantic, stylistic, metadata) whose relevant subspace is selected dynamically by the querying agent — late interaction generalized to arbitrary attribute axes. MUVERA-style compression extends from "multi-token" to "multi-facet". P ≈ 50%. Falsifier: single-vector remains the dominant format for 90%+ of new production deployments in 2030.

F6 · An embedding-free retrieval architecture beats a dense one on a major benchmark. Mechanism: hybrid stacks of structured indices (code ASTs, scene graphs, symbolic memory), LLM-generated query-specific indices, and neural attention over raw corpora — rather than pre-computed dense vectors — reclaim the top of some leaderboard. Precedent: SPLADE in text showed sparse can beat dense; the multimodal analogue has not landed but is plausible. P ≈ 40%. Falsifier: every top-5 entry on every major retrieval benchmark through 2030 uses pre-computed dense vectors as the primary primitive.

F7 · Embedding geography bifurcates into distinct technical lineages, not just distinct vendors. Mechanism: M7 compounds over five years. By 2030, the Western lineage optimizes for API breadth + governance; the Chinese lineage optimizes for open-weights scale + domestic deployment; a third lineage (EU / sovereign clouds) optimizes for auditability + privacy. Interoperability exists at the interface level (vectors are vectors) but not at the training-data or compliance level. P ≈ 45%. Falsifier: a single global embedding stack (most likely a hyperscaler's) dominates all three geographies in 2030.

F8 · Human-in-the-loop retrieval is the dominant consumer paradigm; vector similarity is invisible to the user. Mechanism: the "search bar returns ten links" UX is already eroding — Perplexity, ChatGPT Search, Google AI Mode replace it with a conversational reformulation loop. In that loop, embedding is a background primitive, not a product surface; the product is the conversation. P ≈ 75%. Falsifier: classical search UX (query → ranked list) remains the dominant consumer modality in 2030.

Bets I am not making

For calibration, three non-bets for the 12–24 month window. No single MLLM encoder will "rule them all" — the C–L–I triangle is real and stacks will hybridize, not collapse. No >82 MMEB-V2 model will win on all tasks equally — saturation plus contamination make this a diminishing-returns problem; specialised models will win specialised axes. Open-weights audio-text embeddings will not reach Gemini Embedding parity by end-2027 — the data gap is too wide.

Where the field is actually stuck — the open problems that would move the far tier

If you want a paper that is read in 2028 rather than filed, pick one of these and actually solve it. These are not "gaps" — they are unsolved problems whose solutions would change the probabilities in the far-tier table above.

  1. Long-video retrieval (>30 min). R@1 on LongVideoBench-Retrieval is materially below short-video baselines. Temporal reasoning, scene-change detection, efficient video-token compression are all open. Closed APIs (Twelve Labs Marengo-2.7, Pegasus-1.2) are partial answers.
  2. Fine-grained visual grounding at retrieval time. ColPali localizes within a document; nothing localizes within a photo or a long video frame well. Snappy (arXiv 2512.02660) is a patch-to-region propagation partial answer.
  3. Calibration across modalities. Cosine similarity of T-T, I-I, and T-I pairs is not on the same scale even after training. Thresholds must be per-modality. No paper formalizes this.
  4. Incremental / streaming multi-vector indexing. MUVERA helps, but "add 10K pages to a 10M collection" is still thousands of inserts per document and requires periodic recentering.
  5. Compositional generalization evaluation. ARO, MMVP, SPEC are small and somewhat adversarial; a real-world compositional-generalization benchmark would cost more to build than any single lab has funded.
  6. Embodied / spatial retrieval. No consensus representation for "the scene in front of me, indexed by location and time".
  7. Preference-aligned retrievers. Contrastive InfoNCE is a weak proxy for user utility. Replacing it with a judge-in-the-loop or RL-from-interaction objective at scale is open.

Part VIIReference

7.1 Paper list — Era 1–2 foundations (2021 – mid-2025)

arXivPaperAbbr.Affiliation
2103.00020Learning Transferable Visual Models From Natural Language SupervisionCLIPOpenAI
2202.06767Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training BenchmarkWukongHuawei
2205.01917CoCa: Contrastive Captioners are Image-Text Foundation ModelsCoCaGoogle Research
2210.01936When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It?AROStanford
2210.08402LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text ModelsLAION-5BLAION
2211.01335Chinese CLIP: Contrastive Vision-Language Pretraining in ChineseChinese CLIPAlibaba
2211.06679AltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesAltCLIPBAAI
2212.00794Scaling Language-Image Pre-training via MaskingFLIPMeta FAIR
2212.07143Reproducible Scaling Laws for Contrastive Language-Image Learning (CVPR 2023)OpenCLIPLAION
2303.15343Sigmoid Loss for Language Image Pre-TrainingSigLIPGoogle
2303.15389EVA-CLIP: Improved Training Techniques for CLIP at ScaleEVA-CLIPBAAI
2304.14108DataComp: In Search of the Next Generation of Multimodal DatasetsDataCompconsortium
2305.19595Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL ModelsDACIBM
2305.20088Improving CLIP Training with Language RewritesLaCLIPGoogle
2309.17425Data Filtering NetworksDFNApple
2310.07699VeCLIP: Improving CLIP Training via Visual-enriched CaptionsVeCLIPApple
2310.13355SILC: Improving Vision Language Pretraining with Self-DistillationSILCGoogle
2311.17136UniIR: Training and Benchmarking Universal Multimodal Information RetrieversUniIRWaterloo
2312.00081Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language UnderstandingSPECFudan
2312.08578Revisiting the Role of Language Priors in Vision-Language Models (DCI)DCIMeta FAIR
2401.06209Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMsMMVPNYU
2401.09865Improving Fine-grained Understanding in Image-Text Pre-trainingSPARCDeepMind
2401.15896M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient PretrainingM2-EncoderAnt Group
2402.03216BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge DistillationBGE M3BAAI
2402.04252EVA-CLIP-18B: Scaling CLIP to 18 Billion ParametersEVA-CLIP-18BBAAI
2403.15378Long-CLIP: Unlocking the Long-Text Capability of CLIPLong-CLIPShanghai AI Lab
2403.17007DreamLIP: Language-Image Pre-training with Long CaptionsDreamLIPZJU + Ant + SJTU + USTC + EIT + NEU
2403.19651MagicLens: Self-Supervised Image Retrieval with Open-Ended InstructionsMagicLensGoogle
2404.04125No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model PerformanceTübingen
2405.13777No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language ModelsDeepMind
2405.16915Multilingual Diversity Improves Vision-Language RepresentationsUW
2405.19504MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings (NeurIPS 2024)MUVERAGoogle Research
2406.11251Unifying Multimodal Retrieval via Document Screenshot EmbeddingDSEWaterloo / JHU
2407.01449ColPali: Efficient Document Retrieval with Vision Language ModelsColPaliilluin-tech
2407.01523MMLongBench-Doc: Benchmarking Long-context Document Understanding with VisualizationsMMLongBench-DocHKU / Alibaba
2407.02883CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCoIRconsortium
2407.12580E5-V: Universal Embeddings with Multimodal Large Language ModelsE5-VBUAA + Microsoft
2410.05160VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding TasksVLM2VecWaterloo + Salesforce
2411.01106SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document UnderstandingSV-RAG
2411.02571MM-Embed: Universal Multimodal Retrieval with Multimodal LLMsMM-EmbedNVIDIA
2412.01720LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant (CVPR 2025)LamRASJTU + Xiaohongshu
2412.04378VladVA: Discriminative Fine-tuning of LVLMsVladVASamsung AI
2412.08802jina-clip-v2: Multilingual Multimodal Embeddings for Text and Imagesjina-clip-v2Jina
2412.14475MegaPairs: Massive Data Synthesis for Universal Multimodal RetrievalMegaPairsBAAI
2412.16855GME: Improving Universal Multimodal Retrieval by Multimodal LLMsGMEAlibaba
2502.08468mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DatammE5RUC + Microsoft
2502.14786SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense FeaturesSigLIP 2Google
2503.04812LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive LearningLLaVETencent
2503.07891Gemini Embedding: Generalizable Embeddings from GeminiGemini EmbeddingGoogle
2503.19900CAFe: Unifying Representation and Generation with Contrastive-Autoregressive FinetuningCAFeMeta
2503.19903Scaling Vision Pre-Training to 4K ResolutionPS3NVIDIA
2504.01017Scaling Language-Free Visual Representation LearningLF-VisionMeta FAIR
2504.10471MIEB: Massive Image Embedding Benchmark (130 tasks, 38 languages)MIEBMTEB team
2504.13181Perception Encoder: The Best Visual Embeddings Are Not at the Output of the NetworkPerception EncoderMeta FAIR
2504.17432Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMsUniMEAlibaba
2505.11293B3: Breaking the Batch Barrier for Vision Language Model Contrastive LearningB3Duke
2505.11651MIRACL-VISION: A Large, Multilingual, Visual Document Retrieval BenchmarkMIRACL-VISION
2505.19650Modality Curation: Building Universal Embeddings for Advanced Multimodal Information RetrievalModality CurationNEU + Kuaishou
2506.04997Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level EmbeddingsLight-ColPali
2506.05176Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation ModelsQwen3-EmbeddingAlibaba
2506.18902jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrievaljina-embeddings-v4Jina
2506.23115MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal EmbeddingsMoCaRUC + Microsoft
2507.04590VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual DocumentsVLM2Vec-V2Waterloo + Salesforce
2507.22062MetaCLIP 2: A Worldwide Scaling RecipeMeta CLIP 2Meta FAIR

7.2 Paper list — Era 3: MLLM-native + reasoning-augmented (Aug 2025 – Apr 2026)

ReferencePaper / ModelAbbr.Affiliation
2508.13843UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal FusionUniECSKuaishou
2509.23661LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal TrainingLLaVA-OV-1.5EvolvingLMMs
2510.05014Think-Then-Embed: Generative Context Improves Multimodal Embeddings (ICLR 2026)TTEMeta + UCF + NYU
2510.13515UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning (AAAI 2026)UniME-V2Alibaba
2510.27350RzenEmbed: Towards Comprehensive Multimodal RetrievalRzenEmbedQihoo 360
2511.21121Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document RetrievalVisionRAGAcademic
2512.02660Spatially-Grounded Document Retrieval via Patch-to-Region Relevance PropagationSnappyAcademic
2601.04720Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and RankingQwen3-VL-EmbeddingAlibaba Qwen
2602.03442A-RAG: Agentic Retrieval-Augmented Generation with Dynamic Tool UseA-RAGAcademic
2602.03992Nemotron-ColEmbed-V2: Scaling Late-Interaction Multimodal Embedders for Visual Document RetrievalNemotron-ColEmbed-V2NVIDIA
2602.07125Reasoning-Augmented Representations for Multimodal RetrievalRARAcademic
2603.01471CoCoA: Collaborative Attention with EOS-Reconstruction for Bidirectional Multimodal EmbeddingCoCoAICT/CAS + Baidu
2603.05551AutoThinkRAG: Adaptive Complexity-Aware Reasoning for Multimodal Retrieval-Augmented GenerationAutoThinkRAGAcademic
2603.14635Compute Allocation for Reasoning-Intensive Retrieval: Where Extra Thinking Actually HelpsComputeAllocAcademic
HFllama-embed-nemotron-8Bllama-embed-nemotronNVIDIA
HFKaLM-Embedding-Gemma3-12BKaLM-Gemma3Tencent
HFHarrier-OSS-v1-27B (MTEB v2 = 74.3)Harrier-OSSMicrosoft
SeedSeed1.6-Embedding (closed API)Seed1.6-EmbeddingByteDance
GoogleGemini Embedding 2 (closed API, T+I+V+Audio+PDF)Gemini Embedding 2Google
HF blogNemotron-ColEmbed-V2 (late-interaction)Nemotron-ColEmbed-V2NVIDIA
HFGME-Qwen2-VL-7B-Instruct (v2)GME-v2Alibaba DAMO
ViDoRe V2ColQwen2.5-multilingual-v1.0ColQwen2.5-mlingilluin-tech
ViDoRe V3ColQwen3-7BColQwen3illuin-tech
HFSauerkrautLM-ColQwen3-2BSauerkrautLM-ColQwen3VAGO
HFNomic Embed Multimodal 7BNomic-Embed-MMNomic AI
CohereCohere Embed v4Embed v4Cohere
VoyageVoyage Multimodal 3Voyage-MM-3Voyage
HF blogEmbeddingGemma-300MEmbeddingGemmaGoogle
HFSnowflake Arctic-Embed 2.0Arctic-Embed 2.0Snowflake
OpenAIOpenAI text-embedding-3OpenAI v3OpenAI
MMEBMMEB-V2 LeaderboardMMEB-V2TIGER-Lab
MTEBMTEB v2 LeaderboardMTEB v2MTEB

7.3 Ecosystem matrix — choose by corpus × modality × freshness × axis

Vector DBs & ANN indexes

The question that matters is "multi-vector + MUVERA", not headline QPS.

SystemVer. (Apr 26)Multi-vec / MaxSimMUVERA / FDEQuantizationWhen to pick
FAISS1.11Client-side onlyNoFull + RaBitQ (new in 1.11)Embedded libraries, research, <50M items
Milvus2.5 GAMulti-vector fields; MaxSim via group_by; native MaxSim on 3.0 roadmap3.0 roadmapBinary / int8 / PQ / GPU-CAGRACloud-native, >100M items, hybrid search
Qdrant1.14Native multivector + MaxSimFastEmbed post-processingBinary / SQ / PQ / MRL truncateDefault for ColPali / ColQwen
Weaviate1.31MaxSim rerank + native MUVERAYes (Encoding.muvera(...))Binary / SQ / PQ / RQGenerative-search-heavy apps; post-MUVERA ColPali
Vespa8.4xxNative tensor MaxSimPartial (tensor ops)Full stackLargest-scale late-interaction (>100M docs)
LanceDB0.24Multi-vector columns + client-side MaxSimNo officialBinary / int8 / PQVideo / blob-heavy lakehouse
pgvector / pgvectorscale0.8 / 0.6Client-side MaxSimNoHNSW / IVFFlat / StreamingDiskANNPostgres-first + >10M vectors

Rerankers — use one, always

RerankerLicenseNiche
Cohere Rerank 3.5Closed APIMultilingual, long-context, tabular & code
Jina Reranker v2-multimodalPartial openCross-modal text ↔ image
Voyage rerank-2 / liteClosed APIBEIR-competitive
mxbai-rerank-large-v2Apache-2.0Strong open-weights generalist
BGE reranker v2-m3 / v2-gemma / v2-minicpm-layerwiseOpenSpeed/quality knob via layerwise
MLLM-as-reranker (GPT-4o/5, Claude 3.5/4, Gemini 2.x, Qwen2.5-VL, InternVL3)API / localTop-k visual rerank, RankGPT-style
RankLLMApache-2.0Listwise LLM rerank library
Think-Then-Embed / UniME-V2 as rerankerOpenReasoning-aware rerank — the 2026 new entry

Multimodal RAG frameworks

FrameworkHighlight
LlamaIndex 0.12 / 0.13MultiModalVectorStoreIndex, tightest DB integrations
Haystack 2.12MultiModalTextEmbedder, Cohere / Jina rerankers
DSPy 2.6 / 3.0 previewMultimodal signatures + MIPROv2 / BootstrapFewShot
Byaldi 0.0.8+ColPali / ColQwen2 / ColSmol wrapper
PyLateTraining + inference for late-interaction (MUVERA-ready)
RAGatouilleUser-friendly ColBERT / PLAID
Morphik 1.xOSS multimodal RAG server; ColPali + VLM routing
RAGFlow 0.16+Document-centric RAG with OCR/layout + ColPali
PixeltableDeclarative multimodal tables with CLIP / YOLO / VLM UDFs

Stack recipes · April 2026

Each annotated with the axes it pushes.

  • 100K product images, millions of queries/day. C only jina-clip-v2 → FAISS IndexHNSWFlat → mxbai-rerank-large-v2 → LLM answer. Skip MLLM encoders and ColPali.
  • 100K enterprise PDFs, chart/table/slide retrieval. L+I ColQwen2.5-multilingual / Nemotron-ColEmbed-V2 → Qdrant multivector (or Weaviate 1.31 with MUVERA) → Claude / GPT-4o/5 rerank → VLM answer. The killer app, now affordable at scale.
  • 100M social-media images, compositional queries. C first+I rerank SigLIP 2 first stage (MRL + binary) → Qwen3-VL-Embedding-8B rerank on top-1K → VLM answer. Reserve the MLLM encoder for rerank.
  • Heterogeneous instruction-heavy RAG traffic. I dominant Think-Then-Embed or UniME-V2 reasoning rerank on top of Qwen3-VL-Embedding / RzenEmbed first stage. Monitor zero-shot vs instructed drift.
  • Long-form videos (≥1 hr). C+L open. Marengo-2.7 / Pegasus-1.2 (Twelve Labs) for now; open weights not yet competitive. Re-evaluate Q3 2026.
  • Mixed audio + PDF + image. C breadth Gemini Embedding is the credible unified option in April 2026. Budget for API spend.

7.4 Methodology & epistemic hygiene

Primary sources: arXiv papers (linked), Hugging Face model cards, MMEB / MTEB / ViDoRe / MIEB public leaderboards (snapshotted April 2026), official vendor docs. Secondary: author-run benchmarks on an internal H100 SXM node for §IV; single-configuration, ±20% noise. Tertiary: vendor blogs, release notes, conference talks — linked, used only where primary references are unavailable.

Load-bearing vs illustrative. The C–L–I frame and the §VI forecasts are load-bearing — the thesis depends on them. Specific numeric claims about individual models (MMEB scores, latency, storage) are illustrative — accurate to the snapshot, aging within 6–12 months.

TypeExamplesEpistemic status
Load-bearingC–L–I frame; late interaction = within-item localization; saturation + contamination understate generalization by several points; the forecast betsStands behind these; expects reasonable aging
Directionally correctMUVERA turns L-axis into a first-class primitive; test-time compute belongs in rerank; 2–14B open-weights multimodal disproportionately ChineseStrongly held but context-dependent
Illustrative / snapshotSpecific benchmark numbers; exact VRAM; DB version numbersAges within 6–12 months; re-verify
SpeculativeFar-tier bets (F1–F8)Calibrated opinions, not fact

Known gaps. A guide that hides its gaps is worse than one that names them. As of April 2026, this document does not cover: adversarial robustness (typographic attacks, watermark poisoning, OOD shift — relevant for moderation, T&S, content-auth pipelines); privacy / membership-inference on contrastive models; audio-text embedding in the depth applied to vision; 3D / point-cloud / CAD; on-device / edge embedding; non-search uses of embeddings (RL reward models, moderation, dedup, near-dup); legal / licensing analysis of training-data provenance. If you build in any of these areas, supplement this guide; do not substitute it.

Freshness contract. Embeddings move fast; so do benchmarks. Treat every number older than this edition as stale. Living leaderboards linked in §7.1–7.2 are authoritative; this guide is a frame for reading them, not a replacement.

IMPORTANT · DISCLOSURE

The author works in this field and has production stakes. Readers should discount any claim that conveniently justifies a deployment choice the author has already committed to. The guide aims for a forecasting track record that can be graded publicly (§VI); that is the cleanest correction mechanism we have.


AckAcknowledgements

Thanks to the authors of every paper, model, and leaderboard linked here. Particular debt to the MTEB / MMEB / MIEB / ViDoRe benchmark teams for transparent leaderboards; the ColBERT / ColPali lineage for making late interaction legible; the MUVERA authors for making it affordable; the LLM2Vec / E5-V / VLM2Vec / GME / Qwen-VL-Embedding / Seed1.6 teams for the MLLM-as-encoder recipe; and the many practitioners whose deployment post-mortems informed §IV.

Errors and opinions are the author's alone.

CiteCitation

@misc{li2026beyondclip,
  author       = {Wei Li},
  title        = {Beyond CLIP: A First-Principles Field Guide to Multimodal Embedding, from CLIP to the Post-Embedding Agent Stack},
  howpublished = {\url{https://github.com/BIGBALLON/BeyondCLIP}},
  year         = {2026},
  note         = {2nd edition, April 2026}
}