Improving Long-Context Retrieval with Multi-Prefix Embedding

Zhenglin Yu1, Xueguang Ma1*, Shengyao Zhuang2, Zhichao Xu3, Luyu Gao4, Crystina Zhang1, Jimmy Lin1
1University of Waterloo, 2University of Queensland, 3University of Utah, 4Carnegie Mellon University
*Corresponding author. x93ma@uwaterloo.ca

Abstract

Long-context document retrieval exposes a fundamental tension: single-vector representations compress entire documents into one embedding, inevitably losing fine-grained information, while token-level multi-vector methods preserve detail at prohibitive storage cost. We propose Multi-Prefix Embedding (MPE), which advances the long-context capability of large language model embedding models by partitioning documents into chunks separated by EOS tokens and encoding the entire sequence in a single forward pass. MPE extracts one embedding per prefix position, preserving cross-chunk context through causal attention while enabling fine-grained chunk-level matching via MaxSim scoring. Training MPE requires only document-level relevance labels, sidestepping the need for chunk-level annotations. Comprehensive experiments on MLDR-en, BrowseComp-Plus, and LongEmbed demonstrate the advantage of MPE over competitive single-vector and multi-vector baselines.

Multi-Prefix Embedding

Multi-Prefix Embedding (MPE). A long document is split into chunks and concatenated with EOS tokens into a single sequence. The encoder processes the sequence in one forward pass, and prefix embeddings are extracted at EOS1, ..., EOSK. Relevance is computed via MaxSim.

Method

Given a long document with L tokens, MPE partitions it into K consecutive chunks of length s by inserting an EOS token after every s−1 content tokens. The entire sequence is encoded in a single forward pass through a pretrained causal language model. Because attention is left-to-right, the hidden state at each EOS position conditions on all preceding tokens — each prefix embedding captures both its local chunk and all prior context.

MaxSim Scoring

Query–document similarity is defined as the maximum inner product between the query embedding and any prefix embedding. The max operator reflects the localized nature of relevance: a query typically matches a specific region of a document. We train with standard contrastive loss using cross-device negatives. At search time, all prefix embeddings are indexed in a FAISS flat inner-product index.

Random Prefix-Length Augmentation

Training with a fixed chunk size can overfit the model to a specific granularity. We introduce random prefix-length augmentation: for each training passage, the chunk size is sampled uniformly from [smin, smax]. This exposes the model to diverse prefix boundaries and enables a single model to generalize across evaluation chunk sizes without retraining.

Configurations

We compare five settings that progressively introduce components:

Setting MaxSim Train Cross Chunk-Attn Rand-Size
Single-vector××
MaxP×××
MaxP-Train××
MPE Fixed-N×
MPE-Rand[a,b]

Results

Main Results (nDCG@10)

Method MLDR-en BC+ NarrativeQA 2WikiMQA SummScreen QMSum
BM250.6790.0160.7150.9650.9760.813
jina-v2-base-en0.3700.3940.7400.9350.389
OpenAI-Ada-0020.3870.4110.8010.9180.400
E5-Mistral-7B0.4330.449
BGE-M30.4890.4870.7800.9400.355
BGE-M3mv0.5580.554
Single-vector0.5480.1320.5440.8650.9800.564
MaxP0.7580.1030.6040.9440.9080.550
MaxP-Train0.7760.1040.5570.9450.9030.569
MPE Fixed-640.7830.1220.5760.9400.9250.652
MPE-Rand[32,1024]0.7600.1530.5790.9450.9490.705

Baseline results from original papers. All our methods (below divider) use Qwen3-Embedding (0.6B); multi-vector methods evaluated at chunk size 64. Bold = best among dense methods (excluding BM25). LongEmbed and BrowseComp-Plus results are zero-shot (trained on MLDR-en only).

Key Findings

  • Multi-vector representations drive the biggest gains. MaxP improves MLDR-en from 0.548 to 0.758 by simply chunking at inference time, demonstrating that multiple embeddings recover fine-grained signals that a single vector compresses away.
  • Cross-chunk context helps on structured documents. MPE Fixed-64 outperforms MaxP-Train on BrowseComp-Plus (0.122 vs. 0.104), suggesting contextualization across chunk boundaries mitigates structural fragmentation.
  • Granularity robustness is critical for transfer. On BrowseComp-Plus, all fixed-granularity multi-vector methods underperform single-vector, but MPE-Rand (0.153) is the only variant to surpass it.

End-to-End Agent Evaluation

MPE-Rand as the retrieval backend in an agentic QA pipeline on BrowseComp-Plus (Gemini 3 Flash Preview agent):

Retriever Acc. (%) Recall (%) Avg. Calls Cal. Err. (%)
BM2534.9437.7818.8551.17
Single-vector42.2948.0017.4046.99
MPE-Rand[32,1024]51.4557.7015.7839.74

MPE-Rand improves end-to-end accuracy from 42.29% to 51.45% (+21.7% relative) while requiring fewer search calls.

Granularity Robustness

Granularity mismatch

Granularity mismatch on MLDR-en. Fixed-size MPE specialists (stars = matched train–eval size) degrade sharply at mismatched evaluation granularities, while MPE-Rand traces the upper envelope with a single model.

Fixed-size specialists degrade sharply at mismatched evaluation sizes. MPE-Rand[32,1024] achieves performance comparable to each specialist's peak at the corresponding chunk size, consistently avoiding steep performance drops. This decouples training granularity from evaluation granularity: a single trained model can be deployed at any chunk size — larger chunks for a compact index, finer chunks for higher recall — without retraining.

Analysis: Where Does MaxSim Select?

MaxSim vs ground truth

MaxSim-selected chunk positions vs. Gemini-annotated ground-truth answer positions on MLDR-en (714 passages; 64-token chunks, 128 chunks per document). Left: overlaid position distributions with nearly identical means (chunk 71.9 vs. 71.0). Right: per-passage scatter plot showing strong rank correlation (Spearman ρ = 0.77).

MaxSim localizes the correct chunk within ±1 position for 65.7% of passages, with Spearman ρ = 0.77 (p < 10−138). This confirms that MPE's improvement comes from preserving and identifying relevant chunks, providing a natural source attribution mechanism: the MaxSim-selected chunk can be directly surfaced to users as supporting evidence.

Citation

If you find this work useful, please cite:

@inproceedings{yu2025mpe,
    title={Improving Long-Context Retrieval with Multi-Prefix Embedding},
    author={Yu, Zhenglin and Ma, Xueguang and Zhuang, Shengyao and Xu, Zhichao and Gao, Luyu and Zhang, Crystina and Lin, Jimmy},
    booktitle={LateInteraction Workshop},
    year={2025}
}