Prompting-based Methods for Text Ranking Using Large Language Models

Large Language Models (LLMs) have demonstrated impressive zero-shot performance on a wide variety of NLP tasks. Recently, there has been a growing interest in applying LLMs to zero-shot text ranking. This article describes a recent paradigm that uses prompting-based approaches to directly utilize LLMs as rerankers in a multi-stage ranking pipeline.

Text Retrieval is a central component in several knowledge-intensive Natural Language Processing (NLP) applications. It refers to the task of identifying and ranking the most relevant documents, passages, sentences, or any arbitrary information snippet, in response to a given user query. The quality of text retrieval plays a crucial role in various downstream knowledge-intensive decision-making tasks, such as web search, open-domain question answering, fact verification, etc. by incorporating factual knowledge for decision-making. In large-scale industrial applications, this task is implemented as a multi-stage ranking pipeline composed of a retriever and a reranker. Popular choices for retrievers include BM25, a traditional zero-shot lexical retriever, and Contriver, an unsupervised dense retriever. {BM25, Contriver} + UPR form one of the state-of-the-art zero-shot multi-stage ranking pipeline^.

Given a corpus $C = {D_1, D_2,\ldots, D_n}$ that contains a collection of documents and a query $q$, the retriever model efficiently returns a list of $k$ documents from $C$ (where $k \ll n$) that are most relevant to the query $q$ according to some metric, such as normalized Discounted Cumulative Gain (nDCG) or average precision. The reranker then improves the relevance order by further reranking the list of $k$ candidates in the order of relevance according to either the same or a different metric. The reranker is usually a more effective but computationally more expensive model compared to the retriever.

zero-shot instructional permutation based reranking results — Average results of ChatGPT and GPT-4 on passage reranking benchmarks compared with BM25 and supervised monoT5. Source:[^3]

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of NLP tasks. LLM-based text retrievers excel in contextualizing user queries and documents in natural language, often handling long-form or even conversational inputs^{. These LLMs have also been adapted for zero-shot and few-shot document ranking tasks through various prompting strategies. Sun et al.^{showed that GPT-4 with zero-shot prompting surpassed supervised systems on nearly all datasets and outperformed previous state-of-the-art models by an average nDCG improvement of 2.7, 2.3, and 2.7 on TREC, BEIR, and My.TyDi, respectively. Among the proprietary LLMs, GPT-4 outperformed Cohere’s Rerank, Anthropic’s Claude-2, and Google’s BARD.}}

Recent works, such as InPars, Promptagator, HyDE, use LLMs as auxiliary tools to generate synthetic queries or documents to augment the training data for retrievers or rerankers. Interested readers can refer to the article linked below to read more about these unsupervised methods. The focus of this article instead will be on the methods that directly use LLMs as rerankers in the multi-stage pipeline.

Zero and Few Shot Text Retrieval and Ranking Using Large Language Models

This article reviews some of the recent proposals from the research community to boost text retrieval and ranking tasks using LLMs.

Based on the type of instruction employed, the ranking strategies for utilizing LLMs in ranking tasks can be broadly categorized into three main approaches: Pointwise, Pairwise, and Listwise methods. Given the user query and candidate documents as input, these methods employ different prompting methodologies to instruct the LLM to output a relevance estimation for each candidate document^.

Given a query $q$ and a set of candidate items $D={d_1, d_2, \ldots, d_n}$, the objective is to determine the ranking of these candidates, represented as $R={r_1, r_2, \ldots, r_n}$. Here, $r_i \in {1,2,\ldots,n}$ denotes the rank of the candidate $d_{i}$. For example, $r_i=3$, means that the document $d_i$ is ranked third among the $n$ candidates. A ranking model $f(.)$ assigns scores to the candidates based on their relevance to the query: $s_{i}=f(q,d_i)$, and the candidates are then ranked according to these relevance scores: $r_i= arg sort_i(s_1,s_2,\ldots,s_n)$^.

In the pointwise ranking method, the reranker takes both the query and a candidate document to directly generate a relevance score. These independent scores assigned to each document $d_i$ are then used to reorder the candidate set $D$. The relevance score is typically calculated based on how likely the document is relevant to the query or how likely the query can be generated from the document.

This method can be further classified into two popular approaches based on how the ranking score is calculated.

the pointwise relevance generation approach — The pointwise relevance generation approach. Source:[^11]

ln instructional relevance generation approaches, like Liang et al.^{, the LLMs are generally prompted to output either “Yes” or “No” to determine the relevance of the candidates to a given query. The generation probability is then converted to the relevance score:}

$$ s_i = \begin{cases}
1 + f(\text{Yes} | I_{\text{RG}}(q,d_i)), & \text{if output Yes} \
1 – f(\text{No} | I_{\text{RG}}(q,d_i)), & \text{if output No}
\end{cases} $$

Here $f(.)$ represents the large language model, and $I_{RG}$ denotes the relevance generation instruction that converts the input $q$ and $d_i$ into the text-based prompt.

Query generation approaches, like Sachan et al.^{, use LLMs to generate a query based on the document and measure the probability of generating the actual query.}

UPR Overview — UPR uses any retriever and off-the-shelf PLM for passage reordering

An example of this approach is the Unsupervised Passage Re-ranking (UPR) by Sachan et al^{. UPR follows a zero-shot document reranking approach by applying an off-the-shelf pre-trained language model (PLM). It appends a natural language instruction “Please write a question based on this passage” to the document $d_i$ (or “passage”) tokens and computes the likelihood of query (or “question”) generation conditioned on the passage:}

$$ \log p(q | d_i) = \frac{1}{|q|} \sum_{t} \log p(d_t | q_{<t}, d_i; \Theta) $$

where $\Theta$ denotes the PLM parameters, and $|q|$ denotes the number of question tokens. The candidate set of documents is then sorted based on $\log p(q | z)$. UPR codebase including data and checkpoints is available on GitHub.

In pairwise ranking strategy, a pair of candidate items ($d_i, d_j$) along with the user query ($q$) serve as prompts to guide the LLMs to determine which document is the most relevant to the given query.

Pairwise Ranking Prompting — Pairwise ranking can either directly use generated text or log-likelihood of the model generating the text given the prompt. Source:[^11]

$$ c_{i,j} = \begin{cases}
1, & \text{if } f(I_{\text{PRP}}(q, d_i, d_j)) = i \
0, & \text{if } f(I_{\text{PRP}}(q, d_i, d_j)) = j \
0.5, & \text{else}
\end{cases} $$

Here, $c_{i,j}$ denotes the choice of LLM $f(.)$, and $I_{PRP}$ is a specific pairwise comparison instruction employed to instruct the LLM. This approach usually consults the LLM twice (with $I_{PRP}(q, d_i, d_j)$ and $I_{PRP}(q, d_j, d_i)$) for every pair $d_i$ and $d_j$ because LLMs exhibit sensitivity to the order of the text in the prompt.