Probing Ranking LLMs: Mechanistic Interpretability in Information Retrieval

Tanya Chowdhury University of Massachusetts AmherstMAUSA  and  James Allan University of Massachusetts AmherstMAUSA
(2018)
Abstract.

Transformer networks, especially those with performance on par with GPT models, are renowned for their powerful feature extraction capabilities. However, the nature and correlation of these features with human-engineered ones remain unclear. In this study, we delve into the mechanistic workings of state-of-the-art, fine-tuning-based passage-reranking transformer networks.

Our approach involves a probing-based, layer-by-layer analysis of neurons within ranking LLMs to identify individual or groups of known human-engineered and semantic features within the network’s activations. We explore a wide range of features, including lexical, document structure, query-document interaction, advanced semantic, interaction-based, and LLM-specific features, to gain a deeper understanding of the underlying mechanisms that drive ranking decisions in LLMs.

Our results reveal a set of features that are prominently represented in LLM activations, as well as others that are notably absent. Additionally, we observe distinct behaviors of LLMs when processing low versus high relevance queries and when encountering out-of-distribution query and document sets. By examining these features within activations, we aim to enhance the interpretability and performance of LLMs in ranking tasks. Our findings provide valuable insights for the development of more effective and transparent ranking models, with significant implications for the broader information retrieval community. All scripts and code necessary to replicate our findings are made available.

copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXX

1. Introduction

Motivation: For many decades, the domain of passage retrieval and reranking has predominantly used algorithms grounded in statistical human-engineered features, typically derived from the query, the document set, or their interactions. In particular, the features in training and evaluation datasets such as MSLR (Qin and Liu, 2013) are largely derived from features manually developed to support successful statistical ranking systems. However, recent advances have seen state-of-the-art passage retrieval and ranking algorithms increasingly pivot to neural network-based models. The introduction of Large Language Models (LLMs) has notably widened the performance disparity between neural-based and traditional statistical rankers.

Access to open-source ranking models, particularly those exhibiting performance comparable to GPTs, provides a unique opportunity to explore the inner mechanics of transformer-based architectures. Neural networks are recognized for their robust feature extraction capabilities. However, the representation of these features remains elusive, making it challenging to discern their nature and potential correlation with features engineered by humans. It is this difficult to develop human-understandable explanations and hard for engineers to analyze how networks are producing results.

Despite their efficacy, neural networks present a challenge in terms of transparency: their feature representations are complex and the location of critical features within the network remains obscure. In this context, mechanistic interpretability (Räuker et al., 2023) aims to demystify the internal workings of transformer architectures, showing how LLMs process and learn information differently compared to traditional methods. Our research is motivated by the desire to determine whether the statistical features traditionally valued in algorithms like BM25 and tf*idf retain their relevance within LLM architectures. This study stands to bridge the gap between neural and statistical methodologies, by using probes to generate hypothesis on the inner workings of the LLMs, offering insights that could enhance the field of information retrieval by enabling more intuitive explanations and better support for model analysis and development.

Summary: This study is built upon the Llama2-7b/13b architecture, specifically RankLlama (Ma et al., 2023), a LoRa fine-tuned variant of Llama-2 optimized for passage reranking tasks using the MS Marco dataset. RankLlama, a point-wise ranker, demonstrates substantial accuracy improvements over its non-LLM counterparts (Ma et al., 2023). Our analysis concentrates on extracting activations from the MLP unit of each transformer block, which is posited to contain the key feature extractors (Geva et al., 2020). We aggregate these activations for each input sequence (query-document pairs) and assign labels to these pairs corresponding to each feature that is a target of our probe. Subsequently, we employ a regression with regularization to correlate these activations with the labels. Our findings reveal a pronounced representation of certain MSLR features within the RankLlama model, while others are markedly absent. These insights allow us to formulate hypotheses concerning the underlying circuitry of ranking LLMs.

Research Questions: In this study, we aim to explore several internal mechanistic aspects of ranking LLMs through probing techniques. Specifically, we seek to determine whether known statistical information retrieval (IR) features are present within the activations of LLMs. We are also interested in identifying groups of features and understanding how they may combine or interact within the LLM’s activations. Additionally, we investigate whether LLMs contain components that mimic similarity scores from models like BERT or T5. Another key question we address is whether LLMs extract the same features for highly relevant documents as they do for those with low relevance. Finally, this inquiry extends to examining whether the features extracted by LLMs remain consistent or change when the model encounters out-of-distribution queries or documents. By answering these questions, we aim to gain a deeper understanding of the inner workings of ranking LLMs and the extent to which they align with traditional IR methodologies.

1.1. Contributions and Findings

  1. (1)

    We discovered that several human-engineered metrics, such as covered query term number, covered query term ratio, min of term frequency, min and mean of stream length normalized term frequency, and variance of tf*idf, are represented in LLM feature extractors. In contrast, certain features, including sum of stream length normalized term frequency, max of stream length normalized term frequency, and BM25, show no discernible representation in LLM activations.

  2. (2)

    We identified specific combinations of MSLR features like the sum of covered query term ratio, mean strean length normalized term frequency, and variance of tf*idf), along with the sum’s square and cube, all of which exhibit a strong correlation with LLM activations.

  3. (3)

    We found that the activation patterns of the listed features remain consistent even when the model encounters out-of-distribution queries or documents on RankLlama 7b. However, they do not remain consistent for certain features of RankLlama 13b, suggesting potential overfitting during fine-tuning.

  4. (4)

    All models and datasets used in this study are open source. The scripts and code used in this study are provided along with the work.

2. Background & Related Work

We discuss concepts in Mechanistic Interpretability, in particular sparse probing. Then we touch on inner interpretability approaches in Information retrieval.

2.1. Mechanistic Interpetability & Sparse Probing

Mechanistic Interpretability is the ambitious goal of gaining an algorithmic-level understanding of any deep neural network’s computations (Räuker et al., 2023). This goal can be achieved in numerous ways, some of them being by studying weights (e.g: weight masking (Csordás et al., 2020; Wortsman et al., 2020), continual learning (De-Arteaga et al., 2019) ), studying individual neurons (e.g: excitement based (Zhou et al., 2014), gradient-based (Ancona et al., 2019), perturbation and ablation based  (Zhou et al., 2018)), by studying subnetworks (e.g: sparsity based  (Meister et al., 2021), circuit analysis based  (Fiacco et al., 2019)), by studying representations (e.g: tokens  (Li et al., 2021), attention  (Clark et al., 2019), probing  (Belinkov, 2022)) etc. In this study, we focus on studying the activations of individual neurons and also some groups of neurons with the help of probing regressors. For this, we use concepts from a host of techniques mentioned above.

While the concept of reverse-engineering specific neurons within large language models (LLMs) is relatively new, existing studies such as (Geva et al., 2020, 2022) illustrate that the feed-forward layers of transformers, comprising two-thirds of the model’s parameters, function as key-value memories. These layers activate in response to specific inputs, a mechanism we aim to demystify by reverse-engineering the activation function of these neurons. A notable challenge in this endeavor is the phenomenon of superposition, where early layers in LLMs select and store a vast array of features—often exceeding the number of available neurons—in a linear combination across multiple neurons. In contrast, later layers tend to focus on more abstract features, discarding those deemed non-essential (Gurnee et al., 2023).

Probing aims to determine if a given representation effectively captures a specific type of information, as discussed by Belinkov (Belinkov, 2022). This technique employs transfer learning to test whether embeddings contain information pertinent to a target task. The three essential steps in probing include: (1) obtaining a dataset with examples that exhibit variation in a particular quality of interest, (2) embedding these examples, and (3) training a model on these embeddings to assess if it can learn the quality of interest. This method is versatile as it can utilize any inner representation from any model. However, a limitation of probing methods is that a successful probe does not necessarily mean that the probed model actually utilizes that information about the data  (Ravichander et al., 2020). Belinkov (Belinkov, 2022) provides a comprehensive survey on probing methods for large language models, discussing their advantages, disadvantages, and complexities. Gurnee et al.  (Gurnee et al., 2023) further introduce sparse probing, where they mine features of interest over groups of representations upto k𝑘kitalic_k in size. We build upon sparse probing in this work.

2.2. Inner Interpretability in IR

Most works of interpretability in information retrieval  (Anand et al., 2022) has been model extrinsic (Chowdhury et al., 2023, 2024). Among model intrinsic interpretability methods, some works focus on using gradient based methods to identify important neurons  (Chen et al., 2024; Fernando et al., 2019). Although gradient based methods give an accurate perspective on the flow of information, they are too fine-grained to give a human-level understanding of what is happening. A number of works have attempted to examine the inner workings of neural retrievers to understand if they satisfy IR axioms and/or to spot known features. Fan et al. (Fan et al., 2021) probe fine-tuned BERT models on three different tasks, namely document retrieval, answer retrieval and passage retrieval, to find out if these different forms of relevance lead to different modellings. Zhan et al.  (Zhan et al., 2020) study the attention patterns of BERT after fine-tuning on the document ranking task and make note of how BERT dumps duplicate attention weights on high frequency tokens (like periods). Similar to them, Choi et al. (Choi et al., 2022) study attention maps in BERT and report the discovery of the IDF feature within BERT attention patterns. ColBERT investigations (Formal et al., 2021, 2022) study its term-matching mechanism and conclude that it captures a notion of term importance, which is enhanced by fine-tuning. MacAvaney et al.  (MacAvaney et al., 2022) propose ABNIRML a set of diagnostic probes for neural retrieval models that allow searching for features like writing styles, factuality, sensitivity and word order in models like ColBERT and T5.

These studies, however, have been conducted in the pre-LLM era and hence are much smaller in size. It is unknown if the findings from BERT will hold over LLMs like Llama, given that BERT is an encoder-only model whereas most modern LLMs are decoder-only models. While most of the above works focus on attention heads and pattern, our work focuses on probing the MLP activation layers, which are now believed to be the primary feature extraction location within the LLM (Gurnee et al., 2023). To the best of our knowledge, this is one of the first works towards study of neuron activations in LLMs for a large scale of statistical and semantic features.

Refer to caption
Figure 1. The RankLlama model (7B) internal architecture with 32 layers and 4096 dimensional vectors. Similarly, the 13B models contains 40 layers and 5120 dimensional vectors. Note the score layer, added to the general Llama architecture during fine-tuning using LoRa.

3. Identifying Context Neurons

We describe our probing pipeline to identify context neurons - neurons that are sensitive to or encode desired features - in the LLM architecture.

Ranking Model: For our Ranking LLM interpretability study, we select RankLlama (Ma et al., 2023), an open source pointwise reranker that has been fine-tuned on Llama-2 with the MS MARCO dataset using LoRA. Given an input sequence (query Q, document set D), RankLlama reranks them as :

input𝑖𝑛𝑝𝑢𝑡\displaystyle inputitalic_i italic_n italic_p italic_u italic_t =query:{Q}document:{D}</s>\displaystyle=\mathrm{{}^{\prime}}query:\{Q\}\ document:\{D\}</s>\mathrm{{}^{% \prime}}= start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y : { italic_Q } italic_d italic_o italic_c italic_u italic_m italic_e italic_n italic_t : { italic_D } < / italic_s > start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT
Sim(Q,D)𝑆𝑖𝑚𝑄𝐷\displaystyle Sim(Q,D)italic_S italic_i italic_m ( italic_Q , italic_D ) =Linear(Decoder(input)[1])absent𝐿𝑖𝑛𝑒𝑎𝑟𝐷𝑒𝑐𝑜𝑑𝑒𝑟𝑖𝑛𝑝𝑢𝑡delimited-[]1\displaystyle=Linear(Decoder(input)[-1])= italic_L italic_i italic_n italic_e italic_a italic_r ( italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_i italic_n italic_p italic_u italic_t ) [ - 1 ] )

where Linear(\cdot) is a linear projection layer that projects the last layer representation of the end-of sequence token to a scalar. The model is fine-tuned using a contrastive loss function. RankLlama is accessible on Huggingface111https://huggingface.co/castorini/rankllama-v1-7b-lora-passage and has demonstrated near state-of-the-art performance on the MS MARCO Dev set, as noted by Ma et al. (Ma et al., 2023). We experiment on both the 7b and 13b parameter models fine-tuned for passage reranking. At the time of writing, RankLlama seems to be the only open-source LLM fine-tuned for reranking, and thus the only choice for a mechanistic investigation. However, this study can be carried out for any fine-tuning based LLM, when they are made available in future.

Internal Architecture: We refer to the internal architecture of RankLlama in Figure 1. Each block of the Llama transformer architecture can be divided into two main groups : the Multi-Head self attention layers and the MLP layers. The Llama-2 7b and 13b architectures consist of 32 and 40 such symmetrical blocks, where each component has a dimensionality of 4096 and 5120 respectively. The feed-forward sublayer takes in the output of the multi-head attention layer and performs two linear transformations over it, with an element-wise non-linear function σ𝜎\sigmaitalic_σ in between.

fil=𝐖vl×σ(𝐖klail+bkl)+bvlsuperscriptsubscript𝑓𝑖𝑙superscriptsubscript𝐖𝑣𝑙𝜎superscriptsubscript𝐖𝑘𝑙superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑏𝑘𝑙superscriptsubscript𝑏𝑣𝑙\displaystyle f_{i}^{l}=\mathbf{W}_{v}^{l}\times\sigma(\mathbf{W}_{k}^{l}a_{i}% ^{l}+b_{k}^{l})+b_{v}^{l}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT × italic_σ ( bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

where ailsuperscriptsubscript𝑎𝑖𝑙a_{i}^{l}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the output of the MHA block in the l𝑙litalic_l layer, and 𝐖vlsuperscriptsubscript𝐖𝑣𝑙\mathbf{W}_{v}^{l}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, 𝐖klsuperscriptsubscript𝐖𝑘𝑙\mathbf{W}_{k}^{l}bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, bklsuperscriptsubscript𝑏𝑘𝑙b_{k}^{l}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and bvlsuperscriptsubscript𝑏𝑣𝑙b_{v}^{l}italic_b start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are learned weight and bias matrices.

This MLP layer applies pointwise nonlinearities to each token independently and therefore performs the majority of feature extraction for the LLM. Additionally, the MLP layers account for 2323\frac{2}{3}divide start_ARG 2 end_ARG start_ARG 3 end_ARGrd of total parameters. As a result, our study focuses on the value of the residual stream (also known as hidden state) for token t𝑡titalic_t at layer l𝑙litalic_l, right after applying the MLP layers. For further details about the transformer architecture math, we point the readers to Elhage et al.(Elhage et al., 2021).

Activation Sourcing: We set up PyTorch transformer forward hooks to mine activations from RankLlama. We tokenize and feed the query-document pairs to the pointwise LLM (as shown above) and capture activations corresponding to each input sequence across all layers. The dimension of each layer’s activation is the number of tokens times the number of hidden units in the LLM. This dimension is 4096 in Llama 7b and 5120 in Llama 13b. To reduce complexity of the computation, we aggregate activations across tokens in an input sequence within a layer. Following Gurnee et al.  (Gurnee et al., 2023), we try out mean and max activation aggregation to ultimately obtain a 4096/5120409651204096/51204096 / 5120 dimensional activation vector per layer. This corresponds to aggregated activations for 4096/5120409651204096/51204096 / 5120 neurons in each layer, which is the focal point of our study. We quantize and store these activation tensors to be retrieved later for further experiments. Intuitively, we have captured the internals of the model at this point, in which we can start searching for desired features.

Target Features: The MSLR dataset (Qin and Liu, 2013) provides a comprehensive list of features that have been historically recognized as highly effective for ranking models (Han and Lei, 2018; Hochma et al., 2024). Consequently, we aim to search for a subset of these MSLR features from the RankLlama activations. Specifically, we focus on mining the following 19 features from the MSLR dataset within the LLM (“stream” means the document or passage text): Covered query term number, Covered query term ratio, Stream length, IDF (Inverse Document Frequency), Sum of term frequency, Min of term frequency, Max of term frequency, Mean of term frequency, Variance of term frequency, Sum of stream length normalized term frequency, Min of stream length normalized term frequency, Max of stream length normalized term frequency, Mean of stream length normalized term frequency, Variance of stream length normalized term frequency, Sum of tf*idf, Min of tf*idf, Max of tf*idf, Mean of tf*idf, Variance of tf*idf, BM25.

In addition to MSLR features, we mine for known query-document similarity metrics within the LLM architecture. For this we consider traditional query-document similarity scores like tf*idf cosine scores, Euclidian score, Manhattan score, KL-divergence score, Jensen-Shannon divergence score as well as popular semantic relevance metrics like BERT and T5 scores.

Each feature is probed individually; for instance, the probe for stream length is dedicated solely to that feature without searching for any others. Although many of these features are known to be correlated and may share neurons, this study does not include correlation analysis amongst features.

Refer to caption
Figure 2. Probing for statistical features from the MSLR dataset in RankLlama 7b model. The graph lines indicate the presence of a particular feature along the layers of the LLM. Some features like min of tf*idf show consistent presence across the layers. Other features like covered query term number, covered query term ratio, mean of stream length normalized term frequency and variance of tf*idf show increasing prominence from the 1st layer to the last, ultimately playing an important role in ranking decision making. Other MSLR features like sum of stream length normalized term frequency, max of stream length normalized term frequency, and sum of tf*idf show negative correlation with RankLlama decision making.

Probing Datasets: To facilitate probing (Sajjad et al., 2022), we need to curate a dataset of input sequences to study the model. We select query-document pairs from the MS MARCO test set (Nguyen et al., 2016) for this purpose. For each query, we include documents that are highly relevant, highly irrelevant, and of intermediate relevance. Our input sequences for the ranking LLM consist of these query-document pairs. Note, it is important for each probing dataset to be balanced, i.e contain a uniform number of samples across the rankge of values of the feature being probed (Gurnee et al., 2023; Belinkov, 2022). For instance, when analyzing the BM25 feature, our dataset included a balanced ratio of query-documents with both low and high BM25 values. Having an unbalanced probing dataset can lead to a biased analysis as the model might have a tendency to overfit the dominant class. For each input sequence, we calculate the expected value of the feature being studied. Consequently, we compute the values for each of the 19 MSLR features 7 similarity scores for our query-document pairs.

Identifying context neurons: After collecting activations and feature labels across a wide range of input sequences (query-document pairs), we start the process of identifying context neurons. Unlike most prior research that employs classification-based probes (MacAvaney et al., 2022; Fan et al., 2021), we utilize ridge regression-based probes to capture the continuous nature of the features under study. We first split the activations and their corresponding labels into training, validate and testing sets (60:20:20).

Subsequently, we perform a layer-wise analysis. For each layer and feature, we employ a sparse probe by fitting the activation vectors of the training split to the corresponding feature labels. To maintain the sparsity of the context neurons, we use Lasso (L1 regularization) with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. We perform 5-fold cross validation to avoid overfitting a provide a more reliable estimate. After fitting the activations to the feature’s labels, we compute the R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score to measure how well the regression curve explains the variability of the labels. R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ranges from 0 to 1, with 1 indicating that the curve perfectly explains the varibiality in the data. A high R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score indicates the presence of a neuron that is sensitive to or activates for the feature being studied, whereas a negative R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT confirms a negative correlation between the feature and the neuron’s activations.

Experimental Details: For our probing experiments, we select 500 queries from the MS MARCO dev set and retrieve the top 1000 documents for each query from the MS MARCO collections corpus using a BM25 ranker. We then compute RankLlama activations for each query-document pair using both the 7b and 13b models. The results reported in the following section are after mean aggregation and quantization for efficient storage. The size of each probing dataset is greater than 10,000 to ensure incorporation of a wide range of values for the feature being probed. The 1000 documents retrieved for a query serve as the query’s corpus for idf calculation. We use the bert-base and t5-base models for BERT and T5 score computations. We use the well-known Okapi implementation of BM25 in our experiments.

4. Research Questions

4.1. Known features within RankLlama

4.1.1. MSLR features

We designed sparse probes for each of the selected MSLR features and conducted experiments on each layer of the Llama-2-7b architecture. All experiments were run with both mean and max activation aggregation over the input sequence. Our findings (Figure 2) are categorized into two groups: (i) Features that exhibit a strong fit in certain layers, (ii) Features that do not correlate to the LLM activations in any layer

Positives: Features that achieve an R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score greater than 0.85 in any particular layer are considered strong positives. This means it is highly likely that the activation has extracted the particular feature within the neuron. We found that several metrics, namely covered query term number, covered query term ratio, min of term frequency, mean of stream length normalized term frequency, variance if tf*idf and min of tf*idf, frequently exhibit low Mean Square Errors and high R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores. This indicates that there exist MLP neurons within the investigated layers that perform feature extractions similar to these MSLR features. Additionally, we observed that the R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores obtained with mean and max activation aggregation are quite similar, except minor exceptions, suggesting either aggregation method is usable.

Negatives: Certain features failed to achieve an R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score greater than 0.1 in the final layer, often achieving a highly negative R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This indicates that the LLM does not consider these features important in their current form. Features from the MSLR set that fell into this category include: sum and max of tf*idf; sum and max of stream length normalized term frequency, max and variance of term frequency, stream length and sum of term frequency. Thus, while mean of stream length normalized term frequency is something the LLM highly seeks, it does not seek for the sum and max of normalized term frequency at all!

Refer to caption
Figure 3. Plot showing R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores of statistical query-document distance metrics when used to probe Rankllama-7b. Scores indicate that these distance metrics are not encapsulated within Rankllama as is.

4.1.2. Traditional similarity metrics

We also probe the LLM architecture for non-semantic statistical query-document distance metrics like tf*idf cosine score, euclidean score, manhattan score, Kullback-Leibler divergence score and Jensen-Shannon divergence score. Our motive behind this is to identify if Llama includes components that mimic statistical similarity measures. Results for the probe on RankLlama-7b are visualized in Figure 3. We observe that statistical score based methods do not correlate well with neuron activations. Out of the feature examined, euclidean score shows the highest correlation with R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT reaching 0.60.60.60.6, dipping before the final layer. Given its well known success in the past, it is ironic that the tfi*df cosine score between the query and document shows the least correlation when probed for, within RankLlama activations.

4.1.3. BERT and T5 scores

BERT and T5 are neural models widely used for retrieval and reranking tasks before the advent of LLMs. LLMs are much larger in size compare to BERT models. As a result it is of interest to see if BERT and T5 subnetworks are present within RankLlama activations. We design probes to mine for the cosine distance between the BERT and T5 embeddings of the query and the document. Our observations (presented graphically in supplementary material) show that both BERT and T5 obtain moderate R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores in our probes, reaching 0.70.70.70.7 and 0.820.820.820.82 on RankLlama 7b and 13b respectively. The BERT and T5 scores follow each other across the LLM layers. This indicates that the LLM potentially does not encode BERT and T5 subnetworks as-is. We compare our findings to related works on BERT probing in the Discussion (Section 6).

Refer to caption
Figure 4. Graph showing R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT scores of feature groups (QTR+STF+VTFIDF)𝑄𝑇𝑅𝑆𝑇𝐹𝑉𝑇𝐹𝐼𝐷𝐹(QTR+STF+VTFIDF)( italic_Q italic_T italic_R + italic_S italic_T italic_F + italic_V italic_T italic_F italic_I italic_D italic_F ), (QTR+STF+VTFIDF)2superscript𝑄𝑇𝑅𝑆𝑇𝐹𝑉𝑇𝐹𝐼𝐷𝐹2(QTR+STF+VTFIDF)^{2}( italic_Q italic_T italic_R + italic_S italic_T italic_F + italic_V italic_T italic_F italic_I italic_D italic_F ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and (QTR+STF+VTFIDF)3superscript𝑄𝑇𝑅𝑆𝑇𝐹𝑉𝑇𝐹𝐼𝐷𝐹3(QTR+STF+VTFIDF)^{3}( italic_Q italic_T italic_R + italic_S italic_T italic_F + italic_V italic_T italic_F italic_I italic_D italic_F ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over the layers of the Llama 7b and 13b architectures, where QTR𝑄𝑇𝑅QTRitalic_Q italic_T italic_R represents covered query term ratio, STF𝑆𝑇𝐹STFitalic_S italic_T italic_F represents mean stream length normalized term frequency and VTFIDF𝑉𝑇𝐹𝐼𝐷𝐹VTFIDFitalic_V italic_T italic_F italic_I italic_D italic_F represents variance of tf*idf.

4.2. Feature Groups within RankLlama

In the previous subsection, we found that several MSLR features like covered query term number, covered query term ratio, mean of stream length normalized term frequency, min of tf/tf*idf and variance of tf*idf appear to be modelled within LLM activations, especially within the later layers. These features also seem to be a fair choice to model relevance based on intuition. However, these features might not be present directly but in combination with one another or individually with different exponents. To test for this possibility, we probe for combinations of these features within different layers of the LLM and find various combinations of covered query term ratio, mean of stream length normalized term frequency and variance of tf*idf. We find strong indication for their presence within the LLM activations of the later layers of the LLM. For example, if QTR represents covered query term ratio, STF represents mean of stream length normalized term frequency and VTFIDF represents variance of tf*idf, we find high scores on average for all of QTR+STF, QTR+VTFIDF, STF+VTFIDF, QTR+STF+VTFIDF, QTR*STF, STF*VTFIDF, QTR*VTFIDF, and QTR*STF*VTFIDF when probing the last layer of the LLM. This potentially indicates that the LLM has over the course of layers learned some representation of (QTR+STF+VTFIDF)k. In Figure 4, we show the probing performance of a sum of those three features and their exponents, relative to BM25. We observe that the sum of this feature group, its square, as well as its cube consistently show a strong correlation to neuron activations within RankLlama.

4.3. Out-of-distribution queries and documents

We know that RankLlama 7b and 13b have been obtained by fine-tuning Llama-2 7b and 13b using LoRa on the MS MARCO dev set (Ma et al., 2023). It is of interest to see if the fine-tuned LLM extracts different features from in-distribution and out-of-distribution query-document pairs. As a result, we try probing RankLlama with query-document pairs from 2 others datasets. Firstly, we use the BEIR Scidocs dataset (Thakur et al., 2021), a dataset comprising of scientific documents and queries. We select 200 queries and use bm25 to retrieve the top 1000 documents for our set of queries. We then probe the MS MARCO fine-tuned RankLlama with BEIR activations. We repeat the process with the SoDup dataset (Lambert et al., 2023), which contains a question from stack overflow as query and a list of other relevant questions from Stack overflow as potential duplicate candidates. For each question, we select known duplicate questions from the dataset and treat them as relevant documents. We again pick 200 queries from this dataset for our out-of-distribution probing experiments. We probe for each of the 19 MSLR features on both the RankLlama model sizes and compare probing results between in-distribution and the out-of-distribution datasets. We report our observation in Figure 5. For a given feature, e.g: mean of tf, we see that our probes can not find any activations representing the feature in RankLlama 7b, both with in-distribution and out-of-distribution datasets. However, when probing RankLlama 13b, while out-of-distribution activations do not capture this feature, it gets a hit within in-distribution activations.

We see some variance in results between RankLlama 7b and 13b in this set of experiments. The probes seem to fare similarly on most features between in-distribution and out-of-distribution data on rankLlama 7b. The only exception being sum of stream length normalized tf, which has a negative R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT score for both in and out distribution data, making it’s magnitude insignificant. However, the probes seem to fetch different results between in and out-of-distribution data for RankLlama 13b. It in particular finds probes strongly correlated to stream length, sum and mean of tf, and mean of tf*idf in the in-distribution data, but not for the out-of-distribution data. A likely reason for this might be that RankLlama 13b has overfit to the MS MARCO dev set and hence seeking features like stream length which are known to not generalize for mapping query-document relevance (Belinkov, 2022).

Refer to caption
Figure 5. Probing RankLlama 7b and 13b with in-distribution vs out-of-distribution datasets. We witness that most MSLR features when probed for in Llama 7b, show similar performance with both in-distribution and out-of-distribution datasets. This is however not the case with Llama 13b, where certain features like stream length and sum of term frequency show a strong presence in the in-distribution dataset probe, even though they are unlikely features to influence a ranking decision. This suggests overfitting on the MS MARCO dataset in RankLlama 13b.

5. Validating Probes

While probing primarily serves to generate hypotheses, these can be independently validated using various methods. In this section, we aim to verify the findings from our probing experiments, particularly focusing on neurons that showed strong correlations with probed features within the activations of the RankLLaMA model. We concentrate on the probing patterns in the last layer, as this layer is directly responsible for ranking decisions. To ensure the robustness of these findings, we employ feature attribution analysis using Shapley values (Sundararajan and Najmi, 2020). Shapley values allow us to quantitatively assess the contributions of different neurons and neuron groups to the model’s predictions, thus providing an indirect validation of the hypotheses generated through probing. The approach internally conducts an ablation study of measuring each neuron’s individual contribution to the ranking decision and averages them based on some game theoretic principles(Chowdhury et al., 2024).

We experiment on both RankLLaMA models and focus on the features found to have a strong correlation (R2>0.85superscript𝑅20.85R^{2}>0.85italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0.85) within neurons of the final layer. We work with the same MS MARCO query-documents pairs as used in the probing experiments and use the change in NDCG score on addition of a feature as its value function for feature attribution computation (Lundberg and Lee, 2017).

We run our probing validation experiments for a hundred query-document pairs on each RankLlama model and compute Shapley values of each neuron in the last layer of the LLMs. We report average attributions across neuron groups containing 1 neuron, 2-3 neurons and 4+ neurons. Our results can be found in Table 1. We find that the average feature attribution of neurons in our identified neuron groups is within the 99th percentile in 79 out of the 100 inputs in RankLlama 7b and 84 out of the 100 inputs in RankLlama 13b. This indicates that the identified neurons within the LLM are instrumental in making the ranking decision, thus validating the location of our probes.

Table 1. Validating Probe Findings : Computing Shapley values for neurons identified important in the last layer of RankLlama. The table reports the percentage of times when the identified neurons are within the top 99 percentile of feature attributions. Note there are 4096 neurons in RankLlama 7b and 5120 in RankLlama 13b. The columns refer to averages for features found in 1 neuron, 2-3 neurons and 4+ neurons.
Model 1-neuron 2-3 neurons 4+ neurons
RankLlama 7b 84 79 74
RankLlama 13b 87 82 78

6. Discussion

Comparing Notes with Previous Probing Work: A number of studies have probed BERT and T5 with the aim of understanding if they encode concepts like term frequency and inverse document frequency. Formal et al. (Formal et al., 2021) study the matching process of ColBERT (Khattab and Zaharia, 2020) and conclude that the model is able to capture a notion of term importance and relies on exact matches for important terms. This is in agreement with our findings, where we find strong correlations to term matching features like covered query term number/ratio and stream length normalized term frequency within LLM activations. In a later work, Formal et al. (Formal et al., 2022) conduct a study to measure the out-of-domain zero-shot capabilities of BERT/T5 models in lexical matching on the BEIR dataset, and show that these models fail to generalize lexical matching in out of domain datasets or terms not seen at training time. This finding continues for RankLlama as well, where the model is unable to generalize lexical matching to terms not seen beforehand (Figure 5, RankLlama 13b OOD vs In-distribution probing discrepencies).

Different types of neurons: A monosemantic neuron is a neuron within a neural network that responds to a single, specific feature or concept in the input data (Bau et al., 2020). However, in large neural networks the networks often try to extract more features than the number of neurons. In such scenarios, the model has to perform a superposition of features on single neurons in order to compress desired features into the limited number of neurons. This gives rise to polysemantic neurons – i.e., neurons encoding more than one feature. This phenomenon is usually witnessed in the early layers of the LLM architecture and is difficult to disentangle using linear probes. There is unfortunately no known method to identify monosemantic vs polysemantic neurons via probing. Another common phenomena in sparse LLMs is when multiple neurons within a layer together represent a particular human-known context. Among the features probed in our study, 23%percent2323\%23 % were found in a single neuron, 32%percent3232\%32 % were distributed between 2-3 neurons and the other 45%percent4545\%45 % were distributed in 4 or more neurons.

Limitations of Probing: While probing techniques are widely used to understand the internal workings of LLMs, there are several limitations of using probing for this purpose. (1) Probing techniques reveal correlations between specific features and neuron activations, but not a causal analysis of the decision making process. (2) The insights gained from probing are heavily dependent on the dataset used for probing. If the dataset is not representative of the model’s typical input, the results may not accurately reflect the model’s general behavior. (3) LLMs often rely on complex interactions between multiple features. Probing techniques may fail to capture these interactions, leading to an incomplete understanding of how the model makes decisions. (4) Probing can identify features but may mislead regarding their role or significance in the model’s decision-making process. However, many of the limitations of probing can be addressed by employing various techniques, such as creating balanced probing datasets and independently validating probing results through methods like ablation studies and feature attribution analyses. Ultimately, probing is an effective tool for generating hypotheses about the internal workings of LLMs, provided that these hypotheses are subsequently validated through independent methods.

7. Conclusion and Future Work

In this study, we probed the Llama-2 7b and 13b large language models, each fine-tuned for passage reranking, to look for a subset of features from the MSLR dataset. These human-engineered features, which include query-only, document-only, and query-document interaction features, are recognized as significant in ranking tasks. Using a layer-wise probe, we discovered that the activations in most layers could accurately replicate select features such as min tf*idf, covered query term number/ratio, min of term frequency, and mean of stream length normalized term frequency. This suggests that the fine-tuned Llama network deems these features important. Conversely, we found no correspondence for features like sum and max of tf*idf and sum and max of term frequency in the probed neurons. This indicates the absence of these features in RankLlama’s neural feature extractors. We also reported observations on feature groups that show a strong correlation with RankLlama neurons and hypothesized some abstract features mined by the LLM. Finally, we compared correlated features between in-distribution and out-of-distribution datasets and encountered a case of the LLM’s apparently overfitting during fine-tuning. These findings enhance our understanding of open-source ranking LLMs.

We used ridge regression based linear probes in this study. While linear probes are simple and interpretable, it is unable to disentangle features distributed features that combine in a non-linear way. In the future, it would be interesting to use non-linear probes, such as decision trees or simple neural nets to disambiguate non-linear concepts. The long-term objectives of this endeavor include: (i) identifying potential modifications to existing MSLR features, such that they align further with LLM activations, (ii) deciphering and reverse-engineering segments of the LLM that do not correspond to recognized MSLR features, and (iii) ultimately cataloging all features deemed significant by LLMs and plugging them back into simpler statistical models to enhance the performance and interpretability of statistical ranking models.

These directions will help further explain the underlying mechanisms of ranking LLMs and improve their interpretability.

References

  • (1)
  • Anand et al. (2022) Avishek Anand, Lijun Lyu, Maximilian Idahl, Yumeng Wang, Jonas Wallat, and Zijian Zhang. 2022. Explainable information retrieval: A survey. arXiv preprint arXiv:2211.02405 (2022).
  • Ancona et al. (2019) Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. 2019. Gradient-based attribution methods. Explainable AI: Interpreting, explaining and visualizing deep learning (2019), 169–191.
  • Bau et al. (2020) David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. 2020. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences 117, 48 (2020), 30071–30078.
  • Belinkov (2022) Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics 48, 1 (2022), 207–219.
  • Chen et al. (2024) Catherine Chen, Jack Merullo, and Carsten Eickhoff. 2024. Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1401–1410.
  • Choi et al. (2022) Jaekeol Choi, Euna Jung, Sungjun Lim, and Wonjong Rhee. 2022. Finding Inverse Document Frequency Information in BERT. arXiv preprint arXiv:2202.12191 (2022).
  • Chowdhury et al. (2023) Tanya Chowdhury, Razieh Rahimi, and James Allan. 2023. Rank-lime: local model-agnostic feature attribution for learning to rank. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 33–37.
  • Chowdhury et al. (2024) Tanya Chowdhury, Yair Zick, and James Allan. 2024. RankSHAP: a Gold Standard Feature Attribution Method for the Ranking Task. arXiv preprint arXiv:2405.01848 (2024).
  • Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019).
  • Csordás et al. (2020) Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. 2020. Are neural nets modular? inspecting functional modularity through differentiable weight masks. arXiv preprint arXiv:2010.02066 (2020).
  • De-Arteaga et al. (2019) Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128.
  • Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread 1 (2021), 1.
  • Fan et al. (2021) Yixing Fan, Jiafeng Guo, Xinyu Ma, Ruqing Zhang, Yanyan Lan, and Xueqi Cheng. 2021. A linguistic study on relevance modeling in information retrieval. In Proceedings of the Web Conference 2021. 1053–1064.
  • Fernando et al. (2019) Zeon Trevor Fernando, Jaspreet Singh, and Avishek Anand. 2019. A study on the Interpretability of Neural Retrieval Models using DeepSHAP. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 1005–1008.
  • Fiacco et al. (2019) James Fiacco, Samridhi Choudhary, and Carolyn Rose. 2019. Deep neural model inspection and comparison via functional neuron pathways. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5754–5764.
  • Formal et al. (2021) Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. A white box analysis of ColBERT. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43. Springer, 257–263.
  • Formal et al. (2022) Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2022. Match your words! a study of lexical matching in neural information retrieval. In European Conference on Information Retrieval. Springer, 120–127.
  • Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022).
  • Geva et al. (2020) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2020. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913 (2020).
  • Gurnee et al. (2023) Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610 (2023).
  • Han and Lei (2018) Xinzhi Han and Sen Lei. 2018. Feature selection and model comparison on microsoft learning-to-rank data sets. arXiv preprint arXiv:1803.05127 (2018).
  • Hochma et al. (2024) Yael Hochma, Yuval Felendler, and Mark Last. 2024. Efficient Feature Ranking and Selection using Statistical Moments. IEEE Access (2024).
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  • Lambert et al. (2023) Nathan Lambert, Lewis Tunstall, Nazneen Rajani, and Tristan Thrush. 2023. HuggingFace H4 Stack Exchange Preference Dataset. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
  • Li et al. (2021) Belinda Z Li, Maxwell Nye, and Jacob Andreas. 2021. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737 (2021).
  • Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
  • Ma et al. (2023) Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319 (2023).
  • MacAvaney et al. (2022) Sean MacAvaney, Sergey Feldman, Nazli Goharian, Doug Downey, and Arman Cohan. 2022. ABNIRML: Analyzing the behavior of neural IR models. Transactions of the Association for Computational Linguistics 10 (2022), 224–239.
  • Meister et al. (2021) Clara Meister, Stefan Lazov, Isabelle Augenstein, and Ryan Cotterell. 2021. Is sparse attention more interpretable? arXiv preprint arXiv:2106.01087 (2021).
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016).
  • Qin and Liu (2013) Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. CoRR abs/1306.2597 (2013). http://arxiv.org/abs/1306.2597
  • Räuker et al. (2023) Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 464–483.
  • Ravichander et al. (2020) Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2020. Probing the probing paradigm: Does probing accuracy entail task relevance? arXiv preprint arXiv:2005.00719 (2020).
  • Sajjad et al. (2022) Hassan Sajjad, Nadir Durrani, and Fahim Dalvi. 2022. Neuron-level interpretation of deep nlp models: A survey. Transactions of the Association for Computational Linguistics 10 (2022), 1285–1303.
  • Sundararajan and Najmi (2020) Mukund Sundararajan and Amir Najmi. 2020. The many Shapley values for model explanation. In International conference on machine learning. PMLR, 9269–9278.
  • Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
  • Wortsman et al. (2020) Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. 2020. Supermasks in superposition. Advances in Neural Information Processing Systems 33 (2020), 15173–15184.
  • Zhan et al. (2020) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. An analysis of BERT in document ranking. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1941–1944.
  • Zhou et al. (2014) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014).
  • Zhou et al. (2018) Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. 2018. Revisiting the importance of individual units in cnns via ablation. arXiv preprint arXiv:1806.02891 (2018).