On the Noise Robustness of In-Context Learning
for Text Generation

Hongfu Gao1,2, Feipeng Zhang2, Wenyu Jiang1,3, Jun Shu4, Feng Zheng5, Hongxin Wei1
1Department of Statistics and Data Science, Southern University of Science and Technology
2School of Economics and Finance, Xi’an Jiaotong University
3National Key Laboratory for Novel Software Technology, Nanjing University
4School of Mathematics and Statistics, Xi’an Jiaotong University
5Department of Computer Science and Engineering, Southern University of Science and Technology
Work done while working at SUSTech as a visiting scholar.Corresponding author ([email protected])
Abstract

Large language models (LLMs) have shown impressive performance on downstream tasks by in-context learning (ICL), which heavily relies on the quality of demonstrations selected from a large set of annotated examples. Recent works claim that in-context learning is robust to noisy demonstrations in text classification. In this work, we show that, on text generation tasks, noisy annotations significantly hurt the performance of in-context learning. To circumvent the issue, we propose a simple and effective approach called Local Perplexity Ranking (LPR), which replaces the “noisy” candidates with their nearest neighbors that are more likely to be clean. Our method is motivated by analyzing the perplexity deviation caused by noisy labels and decomposing perplexity into inherent perplexity and matching perplexity. Our key idea behind LPR is thus to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. Our approach can prevent the selected demonstrations from including mismatched input-label pairs while preserving the effectiveness of the original selection methods. Extensive experiments demonstrate the effectiveness of LPR, improving the EM score by up to 18.75 on common benchmarks with noisy annotations. Our code is available at https://github.com/ml-stat-Sustech/Local-Perplexity-Ranking

1 Introduction

Large language models (LLMs) have shown remarkable performance on downstream tasks by in-context learning (ICL) with only a few task demonstrations [7, 10]. Without requiring explicit parameter updates, in-context learning consistently outperforms zero-shot inference on various tasks (e.g., classification and generation), making it a compelling alternative to supervised fine-tuning [13, 16]. In particular, the success of ICL heavily relies on the quality of demonstrations selected from a large set of annotated examples [21, 29, 51, 60]. For those candidates, input-label mappings solicited from humans [61, 73] or LLMs [58] can often be noisy, especially in complex tasks. This gives rise to the importance of noise-robust ICL, which aims to construct effective demonstrations in the presence of noisy and erroneous labels.

Previous works show that in-context learning on classification tasks is fairly robust to label noise in the in-context demonstrations [9, 12, 32, 37, 54, 55]. However, it is still mysterious how noisy labels affect the performance of ICL on text generation tasks. In this work, we present the first study on in-context learning with a noisy annotated dataset for generation. Surprisingly, we empirically find that label noise in the demonstrations significantly degrades ICL’s performance on generation tasks, which is different from previous results on classification. Moreover, increasing the number of selected demonstrations with a fixed noise rate or utilizing more effective selection methods (e.g., TopK [28] and DPP [62]) will intensify the negative effect of noisy labels. This motivates our method, which can universally improve the noise robustness of existing selection methods for in-context learning.

In this paper, we show that the issue of noisy annotations can be mitigated through the perplexity ranking of noisy candidates (i.e., input-label pairs) during selection. Our method, Local Perplexity Ranking (dubbed LPR), is motivated by our analysis of the perplexity deviation caused by noisy labels (i.e., incorrect answers). We find that wrong answers generally result in a higher perplexity of large language models compared to correct ones, in response to the same question. To explain this phenomenon, we decompose the perplexity into two components: inherent perplexity, which measures the task complexity of the question and the correct answer, and matching perplexity, which assesses the perplexity deviation caused by noisy outputs.

Therefore, our key idea behind Local Perplexity Ranking is to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. This can be achieved by ranking candidates’ perplexity alongside their nearest neighborhoods, which usually have similar levels of inherent perplexity. In particular, we replace each low-rank candidate selected by existing methods (e.g., random, TopK, and DPP) with its nearest neighbor that is highly ranked. In effect, our LPR strategy can prevent the selected demonstrations from containing mismatched input-label pairs while preserving the effectiveness of the original selection methods. In this way, we ensure the correctness and relevancy of demonstrations, thereby improving the noise-tolerant ability of in-context learning.

To verify the effectiveness of our method, we conduct extensive evaluations on six text generation datasets, including NQ [22], WebQ [5], SQuAD [46], SCIQ [56], GeoQuery [39] and NL2Bash [27] datasets. The results demonstrate that local perplexity ranking can largely improve the noise-robustness of all existing selection methods under irrelevant and relevant noises. For example, on SCIQ with 60%percent6060\%60 % irrelevant label noise, LPR improves the exact match score of the TopK method from 29.31 to 48.06 – a significant direct improvement of 18.7518.75\bm{18.75}bold_18.75. Moreover, our method can be easily adopted in practice. The performance of LPR is insensitive to the hyperparameters, including the threshold γ𝛾\gammaitalic_γ and the number of local neighbors k𝑘kitalic_k. This approach can effectively generalize to various LLMs to improve their noise-robustness with in-context learning.

Our contributions are summarized as follows:

  • We present the first study to show that annotation quality is crucial for in-context learning in text generation, where noisy annotations significantly hurt the performance. Increasing the set size of demonstrations cannot bridge the gap, as well as picking other selection methods.

  • We propose Local Perplexity Ranking (LPR), a simple and effective method to enhance the noise robustness of in-context learning. The key idea is to decouple the matching perplexity by performing the ranking among the neighbors of each candidate in semantic space.

  • We empirically show that LPR can improve the noise robustness of existing demonstration selection methods in ICL across various types of label noise. In addition to text generation, we also validate the effectiveness of our method in text classification tasks.

2 Preliminary

2.1 In-context learning for generation

We consider in-context learning (ICL) of large language models (LLMs) in generation tasks, where we aim to generate text outputs 𝒚=(y1,,y|𝒚|)𝒚subscript𝑦1subscript𝑦𝒚\bm{y}=(y_{1},...,y_{|\bm{y}|})bold_italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT | bold_italic_y | end_POSTSUBSCRIPT ) (i.e., token sequences) conditioned on the inputs 𝒙=(x1,,x|𝒙|)𝒙subscript𝑥1subscript𝑥𝒙\bm{x}=(x_{1},...,x_{|\bm{x}|})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT | bold_italic_x | end_POSTSUBSCRIPT ) and the context 𝑪Ksubscript𝑪𝐾\bm{C}_{K}bold_italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. In particular, the context 𝑪K={(𝒙i,𝒚i)}i=1Ksubscript𝑪𝐾subscriptsuperscriptsubscript𝒙𝑖subscript𝒚𝑖𝐾𝑖1\bm{C}_{K}=\{(\bm{x}_{i},\bm{y}_{i})\}^{K}_{i=1}bold_italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT contains K𝐾Kitalic_K task demonstrations (e.g., input-output pairs), selected from a large annotated dataset with N𝑁Nitalic_N examples 𝒟={(𝒙j,𝒚j)}j=1N𝒟superscriptsubscriptsubscript𝒙𝑗subscript𝒚𝑗𝑗1𝑁\mathcal{D}=\{(\bm{x}_{j},\bm{y}_{j})\}_{j=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Given a new test input text 𝒙testsubscript𝒙𝑡𝑒𝑠𝑡\bm{x}_{test}bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT, we make the generation of output 𝒚testsubscript𝒚𝑡𝑒𝑠𝑡\bm{y}_{test}bold_italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT via large language models as

ytest𝒫LLM(𝒚test{(𝒙i,𝒚i)}i=1K,𝒙test),similar-tosubscripty𝑡𝑒𝑠𝑡subscript𝒫𝐿𝐿𝑀conditionalsubscript𝒚𝑡𝑒𝑠𝑡subscriptsuperscriptsubscript𝒙𝑖subscript𝒚𝑖𝐾𝑖1subscript𝒙𝑡𝑒𝑠𝑡\textbf{y}_{test}\sim\mathcal{P}_{LLM}(\bm{y}_{test}\mid\{(\bm{x}_{i},\bm{y}_{% i})\}^{K}_{i=1},\bm{x}_{test}),y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∼ caligraphic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∣ { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) , (1)

where similar-to\sim refers to decoding strategies(e.g. greedy decoding and nuclear sampling [17, 62]). Generation with the ICL procedure is especially attractive as it does not require the parameter updating of large language models, which is often expensive and impractical.

Existing studies show that the selection strategy of demonstration plays a crucial role in the ICL performance [25, 31, 43, 44, 47]. A naive method is to randomly sample the demonstrations from annotated examples without repetition [36]. To introduce the relevancy, TopK [28] proposes to select the closest examples to the test input in the embedding space

𝑪K=K(𝒙test)=TopK𝒙(s(𝒙test,𝒙)),subscript𝑪𝐾subscript𝐾subscript𝒙𝑡𝑒𝑠𝑡subscriptTopK𝒙𝑠subscript𝒙𝑡𝑒𝑠𝑡𝒙\bm{C}_{K}=\mathcal{R}_{K}(\bm{x}_{test})=\operatorname{TopK}_{\bm{x}}(s(\bm{x% }_{test},\bm{x})),bold_italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = caligraphic_R start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) = roman_TopK start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , bold_italic_x ) ) ,

where \mathcal{R}caligraphic_R is a retriever, s(𝒙test,𝒙)𝑠subscript𝒙𝑡𝑒𝑠𝑡𝒙s(\bm{x}_{test},\bm{x})italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT , bold_italic_x ) denotes the cosine similarity score between 𝒙testsubscript𝒙𝑡𝑒𝑠𝑡\bm{x}_{test}bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT and examples 𝒙𝒙\bm{x}bold_italic_x from the annotated dataset. We use TopKTopK\operatorname{TopK}roman_TopK to denote the top K𝐾Kitalic_K examples ranked by the score.

These selection strategies focus on the inputs of demonstrations, assuming that all examples are labeled correctly in the large dataset [28, 36, 62]. However, collecting a large-scale dataset with perfectly correct labels is challenging and expensive, especially for generation tasks [2, 64]. In practice, researchers often use crowdsourcing [61, 73] or large language models (LLMs) [58] such as GPT-4 [38] to create input-output pairs for new tasks, which inevitably leads to some mistakes in the annotations. This motivates us to analyze the issue of label quality in ICL for generation tasks.

2.2 Setting of noisy annotations

Given a large-scale dataset with noisy annotations 𝒟~={(𝒙j,y~j)}j=1N~𝒟superscriptsubscriptsubscript𝒙𝑗subscript~y𝑗𝑗1𝑁\tilde{\mathcal{D}}=\{(\bm{x}_{j},\tilde{\textbf{y}}_{j})\}_{j=1}^{N}over~ start_ARG caligraphic_D end_ARG = { ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the selected demonstration might contain mismatched input-output pairs (𝒙,y~)𝒙~y(\bm{x},\tilde{\textbf{y}})( bold_italic_x , over~ start_ARG y end_ARG ), i.e., the output y~~y\tilde{\textbf{y}}over~ start_ARG y end_ARG might be not a correct answer to the input 𝒙𝒙\bm{x}bold_italic_x. Conditioned on the noisy demonstrations, the generation of output via ICL is made as

ytest𝒫LLM(ytest{(𝒙i,y~i)}i=1K,𝒙test).similar-tosubscripty𝑡𝑒𝑠𝑡subscript𝒫𝐿𝐿𝑀conditionalsubscripty𝑡𝑒𝑠𝑡subscriptsuperscriptsubscript𝒙𝑖subscript~y𝑖𝐾𝑖1subscript𝒙𝑡𝑒𝑠𝑡\textbf{y}_{test}\sim\mathcal{P}_{LLM}(\textbf{y}_{test}\mid\{(\bm{x}_{i},% \tilde{\textbf{y}}_{i})\}^{K}_{i=1},\bm{x}_{test}).y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∼ caligraphic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( y start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ∣ { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT ) . (2)

In the real world, noisy annotations may arise from unintentional mistakes or limited knowledge, resulting in various types of noise in the demonstrations. In this work, we define two categories of noisy annotations based on the input-output relevance, as follows:

Irrelevant noise assumes that the generation of noisy annotations is conditionally independent of inputs. For example, crowdsource workers may make mistakes accidentally, introducing random words or sentences in annotations. This can be simulated by reconstructing the output with random words from a subset that does not contain tokens presented in the original input-output pairs.

Relevant noise is a more realistic setting where the corrupted output is relevant to the inputs despite its incorrectness. This type of corruption may occur due to the limited knowledge of annotators and LLMs. We simulate the relevant noise by generating related yet incorrect outputs using ChatGPT-4.

Table 1: An illustration of the effect of three different types of annotated dataset for in-context learning. The middle column is in-context demonstrations, and the last column is the Llama2-7B [49] model prediction. The model tends to learn the label of the demonstration.
Test Input Support: All forms of life are built of at least one cell. A cell is the basic unit of the structure and function of living things. Question: What are the smallest structural and functional units of all living organisms? Output:
Setting In-Context Demonstration Prediction
Clean Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: Cells Cells
Irrelevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: Earth Earth
Relevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: tissues tissues

In Table 1, we present an ICL example of question answering (QA) tasks to illustrate the difference between the two noisy settings. In this example, the clean annotation for the test input is “Cells”. For noisy annotations, the irrelevant noise is randomly sampled as “Earth”, while the relevant noise “tissues” exists in the support of in-context demonstration. We proceed by analyzing the empirical effects of noisy annotations in generation tasks.

3 Empirical study of noisy ICL in text generation

In this section, we investigate the impact of noisy annotations on the performance of in-context learning for text generation. In particular, we conduct experiments on three types of generation tasks, including: question answering (NQ [22], WebQ [5]), reading comprehension (SQuAD [46], SCIQ [56]), code generation (GeoQuery [39], NL2Bash [27]). To simulate the noise, we generate noisy annotations with a pre-defined probability (e.g., 20%percent\%%, 40%percent\%%, 60%percent\%%) in the annotated datasets. We use the output of an input from a different generation task as irrelevant noise, and adopt ChatGPT-4 to generate relevant yet false outputs as relevant noise. Furthermore, we compare the performance of noisy ICL with demonstrations across various set sizes (e.g., 2, 4, 8) and selection methods, including Random [36], TopK [28] and DPP [62]. Following previous work [16, 28, 62], we report the average Exact Match (EM) score with Llama2-7B [49].

ICL is not robust to noisy annotations in text generation.

Figure 1 presents the empirical results of ICL methods with noisy annotations. The results show that both the two types of noises significantly deteriorate the performance of in-context learning on text generation tasks, which is different from the observations of ICL on classification tasks [9, 12, 32, 37, 54, 55]. In particular, a higher noise rate in annotated datasets leads to poorer performance of in-context learning. Moreover, irrelevant noises have a more negative influence than relevant noises, which may benefit the inference in the way of task recognition [40].

The impact of demonstration selection.

To provide a deep understanding of noisy annotations, we analyze the performance of noisy ICL across different demonstration settings, including the set size (i.e., K𝐾Kitalic_K) and selection methods. Results in Figure 1 show that, under the noisy settings, selecting a larger set of demonstrations does not enhance — and may even worsen — the performance of text generation. For example, the ICL performances with K=8𝐾8K=8italic_K = 8 are basically lower than those with K=2𝐾2K=2italic_K = 2, which is inconsistent with the clean setting. In addition, the advantages of those powerful selection methods (i.e., TopK and DPP) are neutralized in the presence of noisy annotations.

Through the empirical analysis, we find that noisy annotations significantly hurt the performance of ICL in text generation tasks. More importantly, increasing the set size of demonstrations cannot bridge the gap, as well as picking an existing selection method, like DPP. This motivates us to design noise-robust methods, which can universally improve the noise robustness of in-context learning.

Refer to caption
Refer to caption
Figure 1: Average ICL performance with noisy annotations in various generation tasks across different demonstration settings. Both the two types of noises significantly deteriorate the performance of in-context learning on text generation tasks. The black line denotes zero-shot performance.

4 Methodology

In this section, we first analyze the perplexity deviation caused by noisy annotations and introduce the disentanglement of perplexity to explain the phenomenon. In light of this, we propose a novel method – local perplexity ranking – to improve the noise robustness of in-context learning for text generation. Our method can be easily incorporated into existing methods of demonstration selection.

4.1 Perplexity deviation of noisy annotations

Refer to caption
Refer to caption
(a) NQ
Refer to caption
(b) WebQ
Refer to caption
(c) SQuAD
Refer to caption
(d) SCIQ
Figure 2: The distribution of perplexity of Llama2-7B [49] on clean and noisy annotations. Examples with noisy annotations indeed obtain higher perplexity than those with clean annotations.

For language models, perplexity measures the degree of uncertainty in generating new tokens. In particular, a low perplexity indicates that the model makes the prediction with high confidence. Therefore, perplexity is commonly used to evaluate the language quality of generated content, e.g., detecting attack prompts [3], out-of-distribution instances [4, 57], hard-to-learn instances [13], and corrupted instances [64]. In light of this, we conjecture that mismatched input-output pairs may result in higher perplexity of LLMs due to their low co-occurrence rate. For instance, in the example presented in Table 1, the term “earth” rarely co-occurs with “cells” and “organ”, so LLMs are more likely to exhibit high perplexity in the input-output pair.

Empirical study

To validate this assumption, we compare the perplexity of clean and noisy annotations in text generation tasks. Specifically, we concatenate each tokenized input-output pair (𝒙,𝒚)𝒙𝒚(\bm{x},\bm{y})( bold_italic_x , bold_italic_y ), and obtain the corresponding tokenized sequence 𝒛=(z1,,z|𝒛|)=(x1,,x|𝒙|,y1,,y|𝒚|)𝒛subscript𝑧1subscript𝑧𝒛subscript𝑥1subscript𝑥𝒙subscript𝑦1subscript𝑦𝒚\bm{z}=(z_{1},...,z_{|\bm{z}|})=(x_{1},...,x_{|\bm{x}|},y_{1},...,y_{|\bm{y}|})bold_italic_z = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT | bold_italic_z | end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT | bold_italic_x | end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT | bold_italic_y | end_POSTSUBSCRIPT ), where |𝒛|=|𝒙|+|𝒚|𝒛𝒙𝒚|\bm{z}|=|\bm{x}|+|\bm{y}|| bold_italic_z | = | bold_italic_x | + | bold_italic_y |. Now, the perplexity of 𝒛𝒛\bm{z}bold_italic_z is calculated as:

Perplexity(𝒛)=exp{1|𝒛|i=1|𝒛|logpθ(zi|z<i)},Perplexity𝒛exp1𝒛subscriptsuperscript𝒛𝑖1subscript𝑝𝜃conditionalsubscript𝑧𝑖subscript𝑧absent𝑖\operatorname{Perplexity}(\bm{z})=\operatorname{exp}\{-\frac{1}{|\bm{z}|}\sum^% {|\bm{z}|}_{i=1}\log p_{\theta}(z_{i}|z_{<i})\},roman_Perplexity ( bold_italic_z ) = roman_exp { - divide start_ARG 1 end_ARG start_ARG | bold_italic_z | end_ARG ∑ start_POSTSUPERSCRIPT | bold_italic_z | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) } , (3)

where logpθ(zi|z<i)subscript𝑝𝜃conditionalsubscript𝑧𝑖subscript𝑧absent𝑖\log p_{\theta}(z_{i}|z_{<i})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) is the log-likelihood of the i𝑖iitalic_i-th token conditioned on the preceding tokens z<isubscript𝑧absent𝑖z_{<i}italic_z start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT, from the given language model parameterized by θ𝜃\thetaitalic_θ.

In Figure 2, we present the perplexity distribution of Llama2-7B [49] on clean and noisy annotations of four datasets. The results illustrate that examples with noisy annotations indeed obtain higher perplexity than those with clean annotations, which confirms our assumption. In particular, relevant noises achieve slightly lower perplexity than irrelevant noises since relevant outputs are close to the inputs despite their erroneous information. However, the deviation of the perplexity distribution caused by noisy annotations is marginal, making it suboptimal to differentiate noisy annotations from clean ones. In the following, we explain this phenomenon with the disentanglement of perplexity.

Disentanglement of perplexity

Given an input-output pair, the perplexity of large language models (LLMs) stems not only from how well the output matches the input, but also from the inherent complexity of the task. For example, a mathematical question with a correct answer can exhibit a higher perplexity than a question of daily life with an incorrect answer. Informally, we decompose the overall Perplexity into two components 111This disentanglement is conceptual rather than mathematical., as shown below:

Perplexity=InherentPerplexity+MatchingPerplexityPerplexityInherentPerplexityMatchingPerplexity\operatorname{Perplexity}=\operatorname{Inherent}\;\operatorname{Perplexity}+% \operatorname{Matching}\;\operatorname{Perplexity}roman_Perplexity = roman_Inherent roman_Perplexity + roman_Matching roman_Perplexity

Here, the inherent perplexity measures how the model is familiar with the task (i.e., the input and the correct output). The matching perplexity quantifies the perplexity deviation caused by noisy outputs, so it can be zero with correct outputs. A higher matching perplexity indicates that the output is more likely to be incorrect for the input. However, directly computing the matching perplexity is non-trivial as clean outputs are unknown. To circumvent the issue, we aim to design an effective method to decouple the matching perplexity from the overall perplexity.

4.2 Local Perplexity Ranking

Intuition

Motivated by the previous analysis, we propose local perplexity ranking (LPR), a general strategy that can improve the noise robustness of in-context learning. Our key idea is to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. Here, our approach is built on two natural assumptions that are naturally satisfied in the real world:

  1. 1.

    The clean annotations are the majority in the annotated dataset.

  2. 2.

    Examples that are semantically similar share the same level of inherent perplexity.

In the literature, Assumption 2 is also supported by previous findings that paragraphs whose representations are close to each other share the same intrinsic task [14, 28, 73]. With the two assumptions, we can approximate the inherent perplexity of a candidate through its neighbors, where most examples are correctly annotated. In other words, the candidate is more likely to be wrongly annotated if its perplexity is relatively higher than its neighbors, and vice versa. With this in mind, we present the details of our approach in the following.

Finding the local neighbors

Given a test input, we first sample a candidate set 𝑪~~𝑪\widetilde{\bm{C}}over~ start_ARG bold_italic_C end_ARG with a pre-defined selection strategy, such as Random [36], TopK [28] or DPP [62]. For each candidate 𝒛superscript𝒛\bm{z}^{*}bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we adopt k𝑘kitalic_k-Nearest-Neighbors (k𝑘kitalic_k-NN) to find its local neighbors that are close to the candidate in token space. Formally, the k𝑘kitalic_k local neighbors are obtained as: Nk(𝒛)={𝒛π(1),𝒛π(2),,𝒛π(k)}subscript𝑁𝑘superscript𝒛subscript𝒛𝜋1subscript𝒛𝜋2subscript𝒛𝜋𝑘N_{k}(\bm{z}^{*})=\{\bm{z}_{\pi(1)},\bm{z}_{\pi(2)},...,\bm{z}_{\pi(k)}\}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { bold_italic_z start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_π ( 2 ) end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_π ( italic_k ) end_POSTSUBSCRIPT }, where π(i)𝜋𝑖\pi(i)italic_π ( italic_i ) is the index of the example with the i𝑖iitalic_i-th smallest distance to the candidate. In particular, we use the cosine similarity score to measure the distance between the candidate 𝒛superscript𝒛\bm{z}^{*}bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and other examples 𝒛𝒛\bm{z}bold_italic_z:

cos(𝒛i,𝒛)=𝒛i𝒛𝒛i2𝒛2.subscript𝒛𝑖superscript𝒛superscriptsubscript𝒛𝑖topsuperscript𝒛subscriptnormsubscript𝒛𝑖2subscriptnormsuperscript𝒛2\cos(\bm{z}_{i},\bm{z}^{*})=\frac{\bm{z}_{i}^{\top}\bm{z}^{*}}{||\bm{z}_{i}||_% {2}||\bm{z}^{*}||_{2}}.roman_cos ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG | | bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG .

Ranking the perplexity

As discussed above, the local neighbors share the same level of inherent perplexity, which enables the comparison of their matching perplexity. For each candidate 𝒛superscript𝒛\bm{z}^{*}bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we propose to rank the perplexity of examples in the cluster of local neighbors 𝒛Nk(𝒛)superscript𝒛subscript𝑁𝑘superscript𝒛\bm{z}^{*}\cup N_{k}(\bm{z}^{*})bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∪ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Formally, we first sort all examples in the cluster in increasing order by the perplexity and obtain the original indices for the sorted scores as:

=argsort{Perplexity(𝒛n)}n=1k+1,𝒛n(𝒛Nk(𝒛)),\mathcal{I}=\operatorname{argsort}\{\operatorname{Perplexity}(\bm{z}_{n})\}_{n% =1}^{k+1},\quad\bm{z}_{n}\in(\bm{z}^{*}\cup N_{k}(\bm{z}^{*})),caligraphic_I = roman_argsort { roman_Perplexity ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ ( bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∪ italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (4)

where Perplexity()Perplexity\operatorname{Perplexity}(\cdot)roman_Perplexity ( ⋅ ) is the overall perplexity defined in Equation 3. In this way, the high-ranking examples are more likely to be correctly annotated than the low-ranking example in the sorted list \mathcal{I}caligraphic_I.

Substituting the noisy candidates

To build the final demonstration set, we propose to replace the noisy candidates with their nearest neighbors that are more likely to be clean. In particular, we can determine whether a candidate should be replaced by:

g(𝒛n)=𝟙(Loc(𝒛n,)k+1γ),𝑔subscript𝒛𝑛1Locsubscript𝒛𝑛𝑘1𝛾g(\bm{z}_{n})=\mathbbm{1}\left(\frac{\operatorname{Loc}(\bm{z}_{n},\mathcal{I}% )}{k+1}\geq\gamma\right),italic_g ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = blackboard_1 ( divide start_ARG roman_Loc ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_I ) end_ARG start_ARG italic_k + 1 end_ARG ≥ italic_γ ) , (5)

where γ𝛾\gammaitalic_γ is the pre-defined threshold (e.g., 50%percent\%%), 𝟙()1\mathbbm{1}(\cdot)blackboard_1 ( ⋅ ) is the indicator function and Loc(𝒛n,)Locsubscript𝒛𝑛\operatorname{Loc}(\bm{z}_{n},\mathcal{I})roman_Loc ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_I ) return the index of 𝒛nsubscript𝒛𝑛\bm{z}_{n}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in the sorted list \mathcal{I}caligraphic_I. It is worth noting that the proposed method is not sensitive to the value of the hyperparameter γ𝛾\gammaitalic_γ, as shown in Subsection 5.1. Then, for those candidates with g(𝒛n)𝑔subscript𝒛𝑛g(\bm{z}_{n})italic_g ( bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we pick the substitutes from their neighbors by:

min{iNk|g(𝒛π(i))=0},𝑖conditionalsuperscript𝑁𝑘𝑔subscript𝒛𝜋𝑖0\min\{i\in N^{k}|g(\bm{z}_{\pi(i)})=0\},roman_min { italic_i ∈ italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_g ( bold_italic_z start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT ) = 0 } ,

where π(i)𝜋𝑖\pi(i)italic_π ( italic_i ) is the index of the example with the i𝑖iitalic_i-th smallest distance to the candidate. After the replacement, we establish the final demonstration set for in-context learning. Noticeably, our method offers several compelling advantages:

  • Algorithm-agnostic: LPR can be easily incorporated into existing demonstration selection methods, consistently improving the robustness against noisy annotations.

  • Easy to use: LPR does not require heavy hyperparameter tuning, as it is insensitive to the threshold value (see Figure 3). LPR does not introduce much computational cost due to the efficient computation of perplexity (see Table 4).

5 Experiments

5.1 Experimental Setup

Datasets. We employ 6 generation datasets for the evaluations, including Open-Domain Question-Answering: NQ [22], WebQ [5]; Reading Comprehension: SQuAD [46] and SCIQ [56]; Code Generation: GeoQuery [39] and NL2Bash [27]. Due to limited space, these tasks’ input/output, statistics, split and evaluation metrics are reported in Appendix A.2.

Models and ICL methods. For the main results, we use Llama-2-7B-Chat [49] as the LLM throughout our experiments. We also provide experiments on other models including Llama2-13B-Chat [49], Mistral-7B [19] and OPT-6.7B [66]. We use bert-base-uncased sentence encoder as the similarity tokenizer [11, 62]. We conduct experiments with existing demonstration selection methods, including Random [36], TopK [28] and DPP [62]. For hyperparameters, we set the number of neighbors k=4𝑘4k=4italic_k = 4 and the threshold γ=50%𝛾percent50\gamma=50\%italic_γ = 50 % by default. The details of our implementation is presented in Appendix A.2.

Table 2: Main results on various datasets. The bold indicates the improvement by integrating LPR.
Dataset Method Clean Irelevant Noise Relevant Noise
0% 20% 40% 60% 20% 40% 60%
NQ Random 14.51±plus-or-minus\pm±0.51 10.97±plus-or-minus\pm±0.29 7.37±plus-or-minus\pm±0.45 4.23±plus-or-minus\pm±0.46 12.00±plus-or-minus\pm±0.65 9.67±plus-or-minus\pm±0.45 6.40±plus-or-minus\pm±1.02
+Ours 15.05±plus-or-minus\pm±0.10 13.31±plus-or-minus\pm±0.25 11.51±plus-or-minus\pm±0.51 8.87±plus-or-minus\pm±0.74 13.74±plus-or-minus\pm±0.12 13.28±plus-or-minus\pm±0.33 9.43±plus-or-minus\pm±0.52
TopK 20.25±plus-or-minus\pm±0.10 13.95±plus-or-minus\pm±1.14 9.97±plus-or-minus\pm±1.13 5.90±plus-or-minus\pm±1.08 16.21±plus-or-minus\pm±0.22 12.22±plus-or-minus\pm±0.22 8.50±plus-or-minus\pm±0.28
+Ours 19.19±plus-or-minus\pm±0.19 17.15±plus-or-minus\pm±0.50 13.54±plus-or-minus\pm±0.41 9.64±plus-or-minus\pm±0.25 17.25±plus-or-minus\pm±0.69 14.82±plus-or-minus\pm±0.51 11.98±plus-or-minus\pm±0.60
DPP 20.35±plus-or-minus\pm±0.76 14.69±plus-or-minus\pm±0.94 9.87±plus-or-minus\pm±0.49 5.97±plus-or-minus\pm±0.48 15.47±plus-or-minus\pm±1.00 11.28±plus-or-minus\pm±0.42 7.89±plus-or-minus\pm±0.25
+Ours 19.68±plus-or-minus\pm±0.33 16.59±plus-or-minus\pm±0.45 13.31±plus-or-minus\pm±0.57 11.18±plus-or-minus\pm±0.50 16.79±plus-or-minus\pm±0.47 14.91±plus-or-minus\pm±0.18 11.94±plus-or-minus\pm±0.91
WebQ Random 20.37±plus-or-minus\pm±0.64 15.18±plus-or-minus\pm±1.06 10.39±plus-or-minus\pm±0.83 4.83±plus-or-minus\pm±0.17 18.29±plus-or-minus\pm±0.43 15.92±plus-or-minus\pm±0.68 13.50±plus-or-minus\pm±0.17
+Ours 21.94±plus-or-minus\pm±0.64 20.32±plus-or-minus\pm±0.92 16.33±plus-or-minus\pm±0.58 12.54±plus-or-minus\pm±0.29 21.51±plus-or-minus\pm±0.33 19.33±plus-or-minus\pm±0.41 16.69±plus-or-minus\pm±1.11
TopK 30.16±plus-or-minus\pm±0.58 22.52±plus-or-minus\pm±0.64 14.52±plus-or-minus\pm±0.78 8.00±plus-or-minus\pm±1.12 27.19±plus-or-minus\pm±0.27 22.82±plus-or-minus\pm±0.75 18.88±plus-or-minus\pm±1.09
+Ours 29.24±plus-or-minus\pm±0.34 26.55±plus-or-minus\pm±0.24 21.67±plus-or-minus\pm±1.28 14.54±plus-or-minus\pm±1.02 28.49±plus-or-minus\pm±0.43 25.44±plus-or-minus\pm±0.68 21.28±plus-or-minus\pm±0.12
DPP 29.40±plus-or-minus\pm±0.39 22.11±plus-or-minus\pm±0.81 13.72±plus-or-minus\pm±0.27 7.33±plus-or-minus\pm±0.68 26.18±plus-or-minus\pm±1.04 21.53±plus-or-minus\pm±0.61 16.80±plus-or-minus\pm±0.17
+Ours 29.92±plus-or-minus\pm±0.48 26.57±plus-or-minus\pm±0.95 21.94±plus-or-minus\pm±1.05 14.85±plus-or-minus\pm±0.81 28.46±plus-or-minus\pm±1.01 25.61±plus-or-minus\pm±0.78 21.35±plus-or-minus\pm±1.17
SQuAD Random 56.50±plus-or-minus\pm±0.57 50.00±plus-or-minus\pm±0.62 39.10±plus-or-minus\pm±0.88 26.20±plus-or-minus\pm±0.79 53.90±plus-or-minus\pm±0.65 49.17±plus-or-minus\pm±0.62 42.03±plus-or-minus\pm±0.79
+Ours 57.73±plus-or-minus\pm±0.79 56.87±plus-or-minus\pm±0.47 48.50±plus-or-minus\pm±0.86 43.00±plus-or-minus\pm±0.86 57.70±plus-or-minus\pm±1.31 53.93±plus-or-minus\pm±0.33 47.93±plus-or-minus\pm±0.48
TopK 56.97±plus-or-minus\pm±0.69 51.83±plus-or-minus\pm±1.03 42.83±plus-or-minus\pm±1.68 29.10±plus-or-minus\pm±2.92 54.77±plus-or-minus\pm±0.69 49.37±plus-or-minus\pm±1.37 41.37±plus-or-minus\pm±2.09
+Ours 57.27±plus-or-minus\pm±0.62 55.40±plus-or-minus\pm±0.37 51.43±plus-or-minus\pm±1.26 41.30±plus-or-minus\pm±2.65 56.90±plus-or-minus\pm±0.64 53.90±plus-or-minus\pm±1.08 48.37±plus-or-minus\pm±0.66
DPP 57.29±plus-or-minus\pm±0.87 50.57±plus-or-minus\pm±0.33 41.63±plus-or-minus\pm±1.00 25.67±plus-or-minus\pm±2.52 56.10±plus-or-minus\pm±0.59 49.57±plus-or-minus\pm±1.24 43.37±plus-or-minus\pm±0.78
+Ours 58.10±plus-or-minus\pm±0.29 56.73±plus-or-minus\pm±0.61 52.53±plus-or-minus\pm±0.33 42.93±plus-or-minus\pm±0.88 57.50±plus-or-minus\pm±0.54 55.90±plus-or-minus\pm±0.18 50.77±plus-or-minus\pm±0.39
SCIQ Random 68.15±plus-or-minus\pm±0.28 59.19±plus-or-minus\pm±1.57 44.19±plus-or-minus\pm±2.89 28.21±plus-or-minus\pm±2.96 64.59±plus-or-minus\pm±1.42 58.39±plus-or-minus\pm±0.16 49.54±plus-or-minus\pm±0.80
+Ours 67.93±plus-or-minus\pm±0.85 65.06±plus-or-minus\pm±1.34 55.57±plus-or-minus\pm±0.53 42.00±plus-or-minus\pm±2.96 66.63±plus-or-minus\pm±0.94 62.70±plus-or-minus\pm±1.10 58.92±plus-or-minus\pm±1.74
TopK 68.62±plus-or-minus\pm±1.13 59.59±plus-or-minus\pm±1.28 45.77±plus-or-minus\pm±2.68 29.31±plus-or-minus\pm±1.73 64.66±plus-or-minus\pm±1.34 58.54±plus-or-minus\pm±0.12 49.47±plus-or-minus\pm±0.65
+Ours 70.06±plus-or-minus\pm±0.32 66.67±plus-or-minus\pm±0.81 57.44±plus-or-minus\pm±1.04 48.06±plus-or-minus\pm±1.53 67.76±plus-or-minus\pm±0.50 63.96±plus-or-minus\pm±1.71 56.32±plus-or-minus\pm±2.18
DPP 67.29±plus-or-minus\pm±0.35 57.69±plus-or-minus\pm±1.83 45.34±plus-or-minus\pm±1.56 28.50±plus-or-minus\pm±1.78 64.88±plus-or-minus\pm±0.43 58.91±plus-or-minus\pm±0.64 50.00±plus-or-minus\pm±0.85
+Ours 70.57±plus-or-minus\pm±0.45 67.86±plus-or-minus\pm±1.43 59.65±plus-or-minus\pm±2.11 45.46±plus-or-minus\pm±2.72 69.16±plus-or-minus\pm±0.98 65.63±plus-or-minus\pm±0.21 56.72±plus-or-minus\pm±1.37
GeoQuery Random 27.97±plus-or-minus\pm±0.99 23.18±plus-or-minus\pm±0.62 17.44±plus-or-minus\pm±1.56 14.10±plus-or-minus\pm±0.74 26.48±plus-or-minus\pm±0.17 26.13±plus-or-minus\pm±0.05 26.25±plus-or-minus\pm±0.40
+Ours 27.27±plus-or-minus\pm±0.36 27.12±plus-or-minus\pm±0.69 25.52±plus-or-minus\pm±1.02 22.23±plus-or-minus\pm±0.67 27.43±plus-or-minus\pm±0.71 27.01±plus-or-minus\pm±0.05 26.73±plus-or-minus\pm±0.90
TopK 44.17±plus-or-minus\pm±0.09 27.28±plus-or-minus\pm±2.65 17.49±plus-or-minus\pm±2.05 9.96±plus-or-minus\pm±3.08 41.31±plus-or-minus\pm±0.46 38.48±plus-or-minus\pm±0.63 34.90±plus-or-minus\pm±0.69
+Ours 43.32±plus-or-minus\pm±0.05 42.25±plus-or-minus\pm±1.00 33.80±plus-or-minus\pm±1.43 24.39±plus-or-minus\pm±1.08 42.59±plus-or-minus\pm±0.37 39.40±plus-or-minus\pm±0.37 37.74±plus-or-minus\pm±1.23
DPP 45.81±plus-or-minus\pm±0.71 31.79±plus-or-minus\pm±5.93 21.54±plus-or-minus\pm±3.36 10.61±plus-or-minus\pm±0.15 42.97±plus-or-minus\pm±1.96 39.91±plus-or-minus\pm±0.42 33.34±plus-or-minus\pm±0.53
+Ours 44.18±plus-or-minus\pm±0.47 43.01±plus-or-minus\pm±0.02 40.94±plus-or-minus\pm±0.91 33.25±plus-or-minus\pm±1.27 41.49±plus-or-minus\pm±0.11 40.62±plus-or-minus\pm±0.06 36.81±plus-or-minus\pm±0.61
NL2Bash Random 27.91±plus-or-minus\pm±0.37 25.37±plus-or-minus\pm±0.21 15.77±plus-or-minus\pm±0.91 8.95±plus-or-minus\pm±0.65 27.20±plus-or-minus\pm±1.06 28.09±plus-or-minus\pm±0.51 26.27±plus-or-minus\pm±0.56
+Ours 29.93±plus-or-minus\pm±1.18 29.09±plus-or-minus\pm±0.26 26.04±plus-or-minus\pm±2.05 22.92±plus-or-minus\pm±0.39 29.01±plus-or-minus\pm±0.36 28.92±plus-or-minus\pm±0.07 26.80±plus-or-minus\pm±0.55
TopK 35.71±plus-or-minus\pm±0.42 27.40±plus-or-minus\pm±0.26 20.00±plus-or-minus\pm±0.62 9.95±plus-or-minus\pm±0.68 32.57±plus-or-minus\pm±0.13 30.21±plus-or-minus\pm±0.08 27.48±plus-or-minus\pm±0.35
+Ours 33.92±plus-or-minus\pm±0.70 32.51±plus-or-minus\pm±1.59 30.50±plus-or-minus\pm±1.02 23.47±plus-or-minus\pm±1.52 31.33±plus-or-minus\pm±0.04 31.39±plus-or-minus\pm±1.70 29.49±plus-or-minus\pm±0.06
DPP 37.77±plus-or-minus\pm±0.02 31.52±plus-or-minus\pm±0.12 23.23±plus-or-minus\pm±0.34 11.16±plus-or-minus\pm±2.14 32.74±plus-or-minus\pm±0.29 32.56±plus-or-minus\pm±0.61 26.72±plus-or-minus\pm±1.58
+Ours 35.85±plus-or-minus\pm±1.51 32.27±plus-or-minus\pm±0.99 32.47±plus-or-minus\pm±0.40 27.84±plus-or-minus\pm±1.17 33.63±plus-or-minus\pm±0.23 32.53±plus-or-minus\pm±0.57 28.96±plus-or-minus\pm±0.98

5.2 Main Results

Can LPR improve the noise-robustness of in-context learning? Table 2 presents the average in-context learning performance of the baselines and our method on six generation tasks, under various types of noisy annotations. A salient observation is that our method drastically improves the the noise-robustness performance of the existing demonstration selection methods by employing LPR. For example, on the NQ with 60%percent6060\%60 % irrelevant noise, our approach improves the EM score of the naive random selection method from 28.21 to 42.00 -a 13.7913.79\bm{13.79}bold_13.79 of direct improvement. Moreover, we show that the LPR can boost performance for a wide range of existing demonstration selection methods such as TopK [28] and DPP [62]. For example, we observe that, on SCIQ with 60%percent6060\%60 % irrelevant label noise, LPR improves the exact match score of the TopK method from 29.31 to 48.06 – a significant direct improvement of 18.7518.75\bm{18.75}bold_18.75. Our method also establish strong robustness against all types of noisy annotations. Appendix 7 reports the results with various demonstration sizes.

How does the threshold γ𝛾\gammaitalic_γ affect the noise-robustness of LPR? In Figure 3 (a) and (b), we ablate how the parameter γ𝛾\gammaitalic_γ in our method (cf. Eq. 5) affects the noise-robust performance. The base indicates all candidate demonstrations are selected without our method. It’s noteworthy that LPR shows robustness to the choice of threshold γ𝛾\gammaitalic_γ, even if we set γ=75%𝛾percent75\gamma=75\%italic_γ = 75 % also yield significant EM score improvements. We can also observe that as the threshold γ𝛾\gammaitalic_γ decrease, the noise-robust performance also improve, especially under 60%percent6060\%60 % noise conditions. Due to space constraints, we only report the average results of multiple baselines on various generation tasks.

Does LPR work with the different number of k𝑘kitalic_k nearest neighbors? We evaluate how the number of nearest neighbors k in our method affects the LPR performance. Specifically, We vary the number of neighbors k={2,4,6}𝑘246k=\{2,4,6\}italic_k = { 2 , 4 , 6 }. As is shown in Figure 3 (c) and (d), an increase in the number of nearest neighbors beyond 0 leads to an evident improvement in EM score, and the performance starts to reach a point of saturation with the further addition of neighbors. Concernedly, more perplexity of nearest neighbors needs to be calculated as k𝑘kitalic_k value increase, but the improvement is limited. For simplicity, we employ a moderate range of neighbors and use k𝑘kitalic_k=4 throughout our experiments.

Table 3: Average test performance of the baselines and our method using varying large language models across various noise types. The results are shown as Naive/+Ours. The bold indicates the improved results by integrating LPR.
Method Clean Irelevant Noise Relevant Noise
0% 20% 40% 60% 20% 40% 60%
Llama2-13B [49] 45.13/45.27 38.58/43.47 29.00/39.24 18.93/30.46 42.18/44.32 37.10/41.88 30.67/36.76
Mistral-7B [19] 34.89/34.12 32.12/33.59 26.28/31.56 19.24/27.03 33.43/33.91 30.52/32.64 26.63/30.00
OPT-6.7B [66] 23.46/24.03 17.26/21.31 11.32/17.29 7.68/12.91 20.16/22.40 17.58/20.22 14.95/17.52
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The average test performance with different thresholds τ𝜏\tauitalic_τ and numbers of local neighbors k𝑘kitalic_k across various noise types. Figure (a) and (b) analyze how the hyperparameter τ𝜏\tauitalic_τ affects the performance of LPR. Figure (c) and (d) illustrate the influence of the hyperparameter k𝑘kitalic_k.

Is LPR effective with different LLMs? To show our proposed method is model-agnostic, we conduct experiments on a diverse collection of model architectures and present the results in Table 3. From the results, we observe that our method consistently improves the ICL performance when using Llama2-13B [49], Mistral-7B [19] and OPT-6.7B [66]. For instance, with Mistral-7B, using our method boosts the ICL performance using the random selection method from 19.24 to 27.07, an average 7.83 of direct improvement on 6 datasets with irrelevant-60%percent\%% noisy annotations.

6 Discussion

Global Perplexity Ranking vs. Local Perplexity Ranking. While our method has demonstrated strong promise in in-context learning, one may also ask: can a similar effect be achieved by selecting demonstrations with the lowest perplexity in the whole dataset? In this ablation, we compare our method with a global perplexity ranking method that selects demonstrations with the lowest perplexity values of input-label pairs from a large candidate set (e.g., {(𝒙i,𝒚i)}i=1100superscriptsubscriptsubscript𝒙𝑖subscript𝒚𝑖𝑖1100\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{100}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT).

Table 4 presents the performance comparison between our method and the global perplexity ranking method. While both the two perplexity ranking methods improve the robustness of ICL against noisy annotations, the global approach obtains inferior performance compared to our proposed method in most cases, especially in the cases of clean and low noise rates. In efficiency, Table 4 also show that the local ranking approach requires only 20%percent2020\%20 % of the time required by the global ranking. This is because our method only calculates the perplexity of the local neighbors for each candidate, instead of using a large candidate pool. Overall, we show that the global ranking method cannot outperform the local ranking while introducing much more computational loads.

Transfer to text classification tasks. Text classification is a common task of in-context learning, which may also suffer from a noisy annotation issue. To this end, we verify the effectiveness of the proposed method in text classification. Here, we consider two classification tasks (SST2 [48] and AGNews [67]) with popular label noise types: the symmetric noise and the asymmetric noise [8, 33]. We report the average accuracy with GPT-Neo-2.7B [6] on datasets with the two noise types. More detailed experimental settings are presented in Appendix A.2.

Table 4: Average test performance comparison between global perplexity ranking and local perplexity ranking. The results are shown as Global/Local. Bold numbers are superior results.
Method Clean Irrelevant Noise Relevant Noise Time (h)
0% 20% 40% 60% 20% 40% 60%
Random 39.32/40.66 38.94/38.89 34.41/32.98 27.82/26.59 39.23/39.90 36.38/37.31 31.76/33.24 2.88/0.55
TopK 40.57/43.94 39.94/41.44 35.85/36.02 31.79/28.38 40.33/42.60 38.69/39.53 33.88/34.48 3.06/0.57
DPP 42.33/44.32 40.18/41.94 36.20/36.86 30.91/28.60 40.42/42.98 38.49/40.51 32.24/35.20 3.21/0.64
Average 40.74/42.97 39.68/40.76 35.49/35.28 30.17/27.86 39.99/41.83 37.85/39.12 32.63/34.31 3.05/0.57
Refer to caption
Refer to caption
(a) SST2
Refer to caption
(b) AGNews - Symmetric
Refer to caption
(c) AGNews -Asymmetric
Figure 4: Average test accuracy on SST2 [48] and AGNews [67]. Different colors indicate the selection methods. The solid lines denote existing selection methods, and the dotted lines represent the method integrated by our method. We omit the noisy type on the binary classification – SST2.

Figure 4 demonstrates that noise annotations barely hurt the performance of ICL when employing the random demonstration selection method [36]. However, the performance of ICL is significantly compromised when utilizing more effective selection methods like TopK [28] and DPP [62]. After integrating our method, both TopK and DPP methods are significantly improved in the inference performance, which indicates the noise robustness of our method in text classification.

Table 5: Average test performance of the baselines and our method for four generation tasks on four datasets with extremely high noise ratios (e.g., 60%percent\%%, 70%percent\%%, 80%percent\%%, 90%percent\%%). The results are shown as Naive/+Ours. The bold indicates the improved results by integrating LPR.
Method Irrelevant Noise Relevant Noise
60% 70% 80% 90% 60% 70% 80% 90%
Random 15.80/26.60 11.61/16.97 7.98/11.24 4.79/5.45 27.87/33.25 24.67/28.29 22.51/24.45 20.15/21.20
TopK 18.08/28.08 14.62/18.24 10.16/10.96 6.25/7.17 29.55/34.48 26.02/29.23 23.28/25.87 21.21/22.68
DPP 16.87/28.61 15.10/18.01 9.93/10.03 6.46/7.18 29.51/35.19 25.85/28.86 23.28/25.27 20.83/21.95
Table 6: Average test performance of the baselines and our method using varying large language models (e.g. OPT-1.3B, OPT-2.7B, OPT-6.7B [66]) across various noise types. The results are shown as Naive/+Ours. The bold indicates the improved results by integrating LPR.
Method Clean Irrelevant Noise Relevant Noise
0% 20% 40% 60% 20% 40% 60%
OPT-1.3B 13.06/13.22 10.48/10.96 8.66/9.63 5.95/6.41 12.21/12.58 11.33/11.53 10.42/10.81
OPT-2.7B 15.30/15.70 12.68/13.23 10.53/11.45 7.01/9.02 14.15/14.73 13.21/14.33 11.86/12.85
OPT-6.7B 23.46/24.03 17.26/21.31 11.32/17.29 7.68/12.91 20.16/22.40 17.58/20.22 14.95/17.52

Potential failure cases. Our approach is built on two assumptions that are naturally satisfied in the real world (See Section 4.2). In this section, we conduct experiments on four generation tasks, including NQ, WebQ, SCIQ, and SQuAD, to determine whether our proposed method remains effective when one of these two assumptions is dissatisfied. The detailed analysis is presented below.

Assumption 1 (Data): clean annotations are the majority in the annotated dataset. Given a dataset with extremely high noise ratios (e.g., 60%percent\%%, 70%percent\%%, 80%percent\%%, 90%percent\%%), the perplexity ranking of local neighbors may not reflect the correctness of the annotations, as most (even all) neighbors can be wrongly annotated. To explicitly show that, we conduct an experiment to validate the performance of LPR under extremely high noise ratios. The Table 5 below presents the average EM score of the baselines and our method. We use Llama2-7B [49] as the LLM throughout our experiments. The results show that the improvements of our approach decrease as the noise ratios increase. For example, when the irrelevant label noise ratio increases from 60%percent\%% to 90%percent\%%, the improvement of our method for the TopK method decreases from 10.26 to 0.92.

Assumption 2 (Model): examples that are semantically similar share the same level of inherent perplexity. The model affects the the performance of LPR through the concept of inherent perplexity. This assumption cannot hold if the model is not capable of precisely measuring the semantic distance between examples. In this case, the local neighbors may not share the same level of inherent perplexity so that we cannot compare the Matching Perplexity. To validate this, we conduct experiments with language models with various sizes, including OPT-1.3B, OPT-2.7B and OPT-6.7B [66]. The results in Table 6 reveal that the performance of LPR decreases as the parameter size of language models decreases. For instance, for 60%percent\%% irrelevant noise, the improvement of our method decreases from 5.23 to 0.46 when the parameter size of the language model decreases from 6.7B to 1.3B.

7 Conclusion

In this paper, we introduce Local Perplexity Ranking (LPR), a general strategy that can universally enhance the noise robustness of in-context learning on generation tasks. To the best of our knowledge, this work is the first to analyze the noisy annotations in ICL for text generation. Our key idea is to decouple the matching perplexity by performing the ranking among the neighbors in semantic space. In particular, we replace each low-ranked candidate with its nearest neighbor that is highly ranked. Extensive experiments demonstrate that LPR can improve the noise robustness of existing demonstration selection methods in ICL across various noise types. Our approach is easy to use in practice, as it is insensitive to the hyperparameters and does not introduce heavy computational cost.
Limitations. LPR is suboptimal in cases of high noise rates due to the assumption that clean annotations are the majority in the dataset. In addition, we do not provide a theoretical analysis to show how noisy annotations affect ICL, which will be an interesting direction for future research.

8 Acknowledgements

This research is supported by the Shenzhen Fundamental Research Program (Grant No. JCYJ20230807091809020). Feipeng Zhang is supported by the National Natural Science Foundation of China (Grant No. 72171192) and the Youth Innovation Team of Shaanxi Universities. Jun Shu is supported in part by the National Natural Science Foundation of China (Grant No. 12326606). Feng Zheng is supported in part by the National Natural Science Foundation of China (Grant No. 62122035). We gratefully acknowledge the support of the Center for Computational Science and Engineering at the Southern University of Science and Technology for our research.

References

  • [1] Maha Agro and Hanan Aldarmaki. Handling realistic label noise in BERT text classification. In Proceedings of the 6th International Conference on Natural Language and Speech Processing, pages 11–20, 2023.
  • [2] Dmitriy Alexandrov, Anastasiia Zakharova, and Nikolay Butakov. Does noise really matter? investigation into the influence of noisy labels on bert-based question answering system. International Journal of Semantic Computing, pages 1–20, 2024.
  • [3] Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2024.
  • [4] Udit Arora, William Huang, and He He. Types of out-of-distribution texts and how to detect them. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10687–10701, 2021.
  • [5] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, 2013.
  • [6] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. 10.5281/zenodo.5297715, 2021.
  • [7] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 1877–1901, 2020.
  • [8] Mingcai Chen, Hao Cheng, Yuntao Du, Ming Xu, Wenyu Jiang, and Chongjun Wang. Two wrongs don’t make a right: Combating confirmation bias in learning with label noise. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14765–14773, 2023.
  • [9] Chen Cheng, Xinzhi Yu, Haodong Wen, Jinsong Sun, Guanzhang Yue, Yihao Zhang, and Zeming Wei. Exploring the robustness of in-context learning with noisy labels. arXiv preprint arXiv:2404.18191, 2024.
  • [10] Julian Coda-Forno, Marcel Binz, Zeynep Akata, Matt Botvinick, Jane Wang, and Eric Schulz. Meta-in-context learning in large language models. Advances in Neural Information Processing Systems, 36:65189–65201, 2023.
  • [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
  • [12] Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, 2023.
  • [13] Hila Gonen, Srini Iyer, Terra Blevins, Noah Smith, and Luke Zettlemoyer. Demystifying prompts in language models via perplexity estimation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10136–10148, 2023.
  • [14] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4849–4870, 2023.
  • [15] Karan Gupta, Sumegh Roychowdhury, Siva Rajesh Kasa, Santhosh Kumar Kasa, Anish Bhanushali, Nikhil Pattisapu, and Prasanna Srinivasa Murthy. How robust are llms to in-context majority label bias? arXiv preprint arXiv:2312.16549, 2023.
  • [16] Shivanshu Gupta, Matt Gardner, and Sameer Singh. Coverage-based example selection for in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13924–13950, 2023.
  • [17] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
  • [18] Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing Xu, and Xiaodan Liang. Nlip: Noise-robust language-image pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 926–934, 2023.
  • [19] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • [20] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, pages 22199–22213, 2022.
  • [21] Jannik Kossen, Yarin Gal, and Tom Rainforth. In-context learning learns label relationships but is not conventional learning. In International Conference on Learning Representations, 2024.
  • [22] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, pages 452–466, 2019.
  • [23] Itay Levy, Ben Bogin, and Jonathan Berant. Diverse demonstrations improve in-context compositional generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1401–1422, 2023.
  • [24] Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.
  • [25] Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. Unified demonstration retriever for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4644–4668, 2023.
  • [26] Xiaonan Li and Xipeng Qiu. Finding support examples for in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6219–6235, 2023.
  • [27] Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. NL2Bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation), 2018.
  • [28] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, 2022.
  • [29] Quanyu Long, Yin Wu, Wenya Wang, and Sinno Jialin Pan. Decomposing label space, format and discrimination: Rethinking how llms respond and solve tasks via in-context learning. arXiv preprint arXiv:2404.07546, 2024.
  • [30] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, 2022.
  • [31] Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. Dr.icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128, 2023.
  • [32] Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, and Hannaneh Hajishirzi. Z-ICL: Zero-shot in-context learning with pseudo-demonstrations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2304–2317, 2023.
  • [33] Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano, Sarah Erfani, and James Bailey. Normalized loss functions for deep learning with noisy labels. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • [34] Costas Mavromatis, Balasubramaniam Srinivasan, Zhengyuan Shen, Jiani Zhang, Huzefa Rangwala, Christos Faloutsos, and George Karypis. Which examples to annotate for in-context learning? towards effective and efficient selection. arXiv preprint arXiv:2310.20046, 2023.
  • [35] Aristides Milios, Siva Reddy, and Dzmitry Bahdanau. In-context learning for text classification with many labels. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, pages 173–184, 2023.
  • [36] Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022.
  • [37] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022.
  • [38] OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2024.
  • [39] Vaishali Pal, Andrew Yates, Evangelos Kanoulas, and Maarten de Rijke. MultiTabQA: Generating tabular answers for multi-table question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6322–6334, 2023.
  • [40] Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning “learns” in-context: Disentangling task recognition and task learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8298–8319, 2023.
  • [41] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • [42] Gaurav Patel, Jan Allebach, and Qiang Qiu. Seq-UPS: Sequential uncertainty-aware pseudo-label selection for semi-supervised text recognition. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6169–6179, 2023.
  • [43] Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. Revisiting demonstration selection strategies in in-context learning. arXiv preprint arXiv:2401.12087, 2024.
  • [44] Chengwei Qin, Aston Zhang, Anirudh Dagar, and Wenming Ye. In-context learning with iterative demonstration selection. arXiv preprint arXiv:2310.09881, 2023.
  • [45] Linlu Qiu, Peter Shaw, Panupong Pasupat, Tianze Shi, Jonathan Herzig, Emily Pitler, Fei Sha, and Kristina Toutanova. Evaluating the impact of model scale for compositional generalization in semantic parsing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9157–9179, 2022.
  • [46] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
  • [47] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
  • [48] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013.
  • [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [50] Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, 2023.
  • [51] Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Advances in Neural Information Processing Systems, pages 15614–15638, 2023.
  • [52] Hongxin Wei, Huiping Zhuang, Renchunzi Xie, Lei Feng, Gang Niu, Bo An, and Yixuan Li. Mitigating memorization of noisy labels by clipping the model prediction. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  • [53] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, pages 24824–24837, 2022.
  • [54] Jerry Wei, Le Hou, Andrew Kyle Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V Le. Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968––979, 2023.
  • [55] Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  • [56] Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, 2017.
  • [57] Qianhui Wu, Huiqiang Jiang, Haonan Yin, Börje Karlsson, and Chin-Yew Lin. Multi-level knowledge distillation for out-of-distribution detection in text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7317–7332, 2023.
  • [58] Tongshuang Wu, Haiyi Zhu, Maya Albayrak, Alexis Axon, Amanda Bertsch, Wenxing Deng, Ziqi Ding, Bill Guo, Sireesh Gururaja, Tzu-Sheng Kuo, Jenny T. Liang, Ryan Liu, Ihita Mandal, Jeremiah Milbauer, Xiaolin Ni, Namrata Padmanabhan, Subhashini Ramkumar, Alexis Sudjianto, Jordan Taylor, Ying-Jui Tseng, Patricia Vaidos, Zhijin Wu, Wei Wu, and Chenyang Yang. LLMs as workers in human-computational algorithms? replicating crowdsourcing pipelines with llms. arXiv preprint arXiv:2307.10168, 2023.
  • [59] Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. OpenICL: An open-source framework for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 489–498, 2023.
  • [60] Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, 2023.
  • [61] Yan Yan, Rómer Rosales, Glenn Fung, Ramanathan Subramanian, and Jennifer Dy. Learning from multiple annotators with varying expertise. Machine Learning, page 291–327, 2014.
  • [62] Jiacheng Ye, Zhiyong Wu, Jiangtao Feng, Tao Yu, and Lingpeng Kong. Compositional exemplars for in-context learning. In Proceedings of the 40th International Conference on Machine Learning, pages 39818––39833, 2023.
  • [63] Xichen Ye, Xiaoqiang Li, songmin dai, Tong Liu, Yan Sun, and Weiqin Tong. Active negative loss functions for learning with noisy labels. In Advances in Neural Information Processing Systems, pages 6917–6940, 2023.
  • [64] Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, and Jian Guo. Noisy pair corrector for dense retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11439–11451, 2023.
  • [65] Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, and Tongliang Liu. IDEAL: Influence-driven selective annotations empower in-context learners in large language models. In International Conference on Learning Representations, 2024.
  • [66] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  • [67] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 28, 2015.
  • [68] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems, 36, 2024.
  • [69] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. In International Conference on Learning Representations, 2023.
  • [70] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, pages 46595–46623, 2023.
  • [71] Yu Zhou, Hongtao Xie, Shancheng Fang, and Yongdong Zhang. Semi-supervised text detection with accurate pseudo-labels. IEEE Signal Processing Letters, pages 1272–1276, 2022.
  • [72] Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Adelani, and Dietrich Klakow. Is BERT robust to label noise? a study on learning with noisy labels in text classification. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 62–67, 2022.
  • [73] Zhaowei Zhu, Zihao Dong, and Yang Liu. Detecting corrupted labels without training a model to predict. In Proceedings of the 39th International Conference on Machine Learning, pages 27412–27427, 2022.

Appendix A Appendix

A.1 Related Work

In-context learning In-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few demonstrations [7, 34, 35, 36]. The popularity of ICL also raises growing concerns regarding its instability: given different selected demonstrations, ICL’s performance can vary from near state-of-the-art to random [9, 12, 15, 26, 32, 37, 55, 54]. Existing studies show that ICL’s performance is highly sensitive to order [30, 60], template [37] and labels [50] of selected demonstrations. For example, on the one hand, some previous studies show that flip classification of demonstration can significantly hurt ICL performance on classification tasks [51, 60]. On the other hand, many researches show that ICL is fairly robust to noisy demonstrations [32, 37, 55, 54]. However, the existing studies only focus on classification tasks and the research of generation tasks is limited. We expand the previous finding from text classification tasks to generation tasks and find that demonstrations selected from noisy annotations significantly hurt the ICL performance of generation tasks.

In practice, researchers often use crowdsourcing [61, 73] or large language models (LLMs) [58] such as GPT-4 [38] to create input-output pairs for new tasks, which inevitably leads to some mistakes in the annotations. However, the existing demonstration selection methods for generation tasks such as TopK [28] or DPP [62] only consider the input of demonstrations and assume the demonstrations are selected from a completely clean dataset such as [16, 28, 62, 65]. In comparison, we aim to propose a training-free demonstration selection method for generation tasks that can consistently and significantly improve the robustness of the existing methods under noisy annotations.

Learning with noisy labels Label noise is common in many real-world datasets, especially generation tasks [2, 64]. The existing approaches to learning with noisy labels can be classified into two types:(1) training noise-robust models with noisy training datasets: designing noise-robust loss function [1, 63, 52, 72] or designing noise-robust model architectures [2, 18, 64] to mitigate label noise. However, this method is not suitable for ICL, which usually hypothesizes that users are unable to apply fine-tuning techniques [68]. (2) detecting noisy labels and reducing their impacts: comparing model predictions with noisy labels [42, 71] or checking the noisy label consensuses of nearby features [73]. Different from the above literature that focuses on classification tasks, we mainly consider a training-free solution to improve noise-robust ICL for generation tasks.

A.2 Experimental Setting

Datasets We conduct experiments on 6 generation tasks, and examples of each dataset are shown in Tables 12 and 13. For open-domain question-answering tasks, we choose the Natural Questions (NQ) dataset [22] and WebQuestions (WebQ) [5]. For reading comprehension tasks, we choose two reading comprehension datasets: Stanford Question Answering (SQuAD) Dataset [46] and Science Questions (SCIQ) dataset [56]. For code generation tasks, we choose Generating Tabular Answers for Multi-Table Question Answering (GeoQuery) Dataset [39] and Natural Language Interface to the Linux Operating System (NL2Bash) dataset [27]. Following previous studies [16, 25, 62], we report Exact Match (EM) for NQ, WebQ, SQuAD and SCIQ, BLEU for NL2Bash and GeoQuery. We collect these dataset from Huggingface. The train sets of these datasets are regarded as examples datasets and the test sets are used to evaluate the performance of ICL. We randomly subsample 20,000 examples from the train set to generate noisy annotations and select demonstrations. We provide a few examples of noisy annotations of each dataset in Tables 14, 15 and 16.

Baselines Our model LPR is essentially a data-centric retriever for in-context demonstration selection. We consider both learning-free and other learning-based retrievers as baselines:

  1. 1.

    Random randomly selects demonstrations from a example set without repetition [36].

  2. 2.

    TopK   retrieve demonstration that are semantically-similar to a test query sample [28].

  3. 3.

    DPP   uses the original BERT embedding as above without fine-tuning, and adopts MAP inference for subset retrieval [62].

Experiment details We run our experiments on NVIDIA L40 GPU. We adopt a large portion of the code from the OpenICL repository [59, 60]. The whole experiment around one week on 8 GPUs and each experiment around one hour on a single GPU.

Transfer to classification tasks Inspired by the idea implemented in above assumption, we assume that examples that are semantically similar share the similar task, indicating they should belong to same classification. We don’t need to calculate the perplexity of input-output pair and only identify whether the classification of candidate demonstration is same with its local neighbors or not. Similar to generation tasks, we replace the noisy candidates with their nearest neighbors that are more likely to be clean. We investigate whether our local-based method can transfer across to classification tasks.

A.3 More empirical results

Empirical study of noisy ICL in text generation In this section, we provide the detailed results of GeoQuery and NL2Bash. Following existing studies [25, 62], we adopt BLEU score [41] to evaluate ICL performance on code generation tasks. Figure 5 shows that both the two types of noises significantly deteriorate the performance of in-context learning on code generation tasks. This phenomenon motivates us to further investigate the noise-robustness of in-context learning.

Refer to caption
Refer to caption
Figure 5: Average results of ICL with noisy annotations in various generation tasks across different demonstration settings. Both the two types of noises significantly deteriorate the performance of in-context learning on code generation tasks. The black line denotes zero-shot performance.

Perplexity deviation of noisy annotations In Figure 6, we present the perplexity distribution of Llama2-7B [49] on clean and noisy annotations of GeoQuery and NL2Bash datasets. As a complement, we observe that examples selected from noisy annotations set indeed obtain higher perplexity than those collected from clean annotations, which confirms the deviation can also transfer to code generation tasks.

Refer to caption
Refer to caption
(a) GeoQuery
Refer to caption
(b) NL2Bash
Figure 6: The distribution of perplexity of Llama2-7B [49] on clean and noisy annotations. Examples with noisy annotations indeed obtain higher perplexity than those with clean annotations.
Table 7: Average test performance of Zero-Shot, In-context learning, Chain-of-Thought (COT) and our proposed method across various noise types. The results are shown as Naive/+Ours. The bold indicates the improved results by integrating LPR.
Method Clean Irrelevant Noise Relevant Noise
0% 20% 40% 60% 20% 40% 60%
Zero-Shot 7.46
Zero-Shot-COT 10.06
Random-ICL/+Ours 27.94/28.60 24.28/28.11 16.61/25.78 11.53/22.58 26.84/28.27 27.11/28.95 26.26/26.76
TopK-ICL/+Ours 39.94/38.62 27.34/36.38 18.75/32.15 9.96/23.93 38.94/36.92 34.35/36.39 31.19/33.62
Manual-COT/+Ours 31.91/31.80 26.57/30.62 17.95/26.64 15.30/23.61 30.57/32.06 29.01/31.02 27.13/30.54
Auto-COT/+Ours 45.69/45.44 30.51/40.10 20.51/34.94 10.86/27.32 41.38/42.78 35.91/40.73 27.90/37.10
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: (a) and (b) demonstrate average EM scores of the baselines and our method for four generation tasks on four datasets with smaller noise ratios. (c) and (d) report average GPT-4 score [70] of the baselines and our method for two long-form and open-domin QA datasets.

Analysis of small noise ratios. In this section, we conduct experiments on datasets with smaller noise ratios (e.g. 5%percent\%%, 10%percent\%%, 15%percent\%%). Figure 7 (a) and (b) present the average EM score on four generation tasks, including NQ, WebQ, SCIQ, and SQuAD. Figure 7 (a) and (b) show that our method can benefit the ICL performance from a small noise rate (e.g. 5%percent\%%)

Open Benchmark Evaluation. Long-form and open-domain QA tasks such as MT-bench [70] and Arena-Hard [24] serve as valuable additions to the current standardized LLM benchmarks. In this section, we conduct experiments on these complex and open tasks to confirm the effectiveness of our method. The results on MT-bench [70] and Arena-Hard [24] are shown in the Figure 7 (c) and (d), which presents the average answer grading (0-10) [70] of baselines and our method. Figure 7 shows that our approach significantly improves the efficacy of existing selection methods on long-form question-answering tasks.

Evaluation on the not-demonstration-selection-based baselines. Here, we add Zero-Shot baseline, as well as some CoT-related baselines, including Zero-Shot-CoT [20] and Manual-CoT [53], Auto-CoT [69]. Specifically, Manual-CoT [53] and Auto-CoT [69] require to select demonstrations from an annotated examples set. The table below presents the BLEU score of the baselines and our method on the two code generation tasks: Geoquery [39] and NL2Bash [27]. We use Llama2-7B [49] as the LLM throughout our experiments. The results in Table 7 show that our method can outperform Zero-Shot and Zero-Shot-CoT [20], and improve the noise robustness of Manual-CoT [53] and Auto-CoT [69].

Transfer to similarity score In LPR, we select reference demonstrations for candidate examples using cosine similarity. While cosine similarity captures some aspects of semantic similarity, it is limited to a single embedding [13]. Another measure of similarity necessitates an accurate characterization of the word levels. One way might be to use larger syntactic substructures of the input as terms with BM25, which is a sparse information retrieval algorithm belonging to a class of TF-IDF measures that view the test input and the candidates as bags of terms and measures relevance as a weighted recall of these terms:

BM25(𝒙i,𝒙)=inWiR(qi,𝒙)BM25subscript𝒙𝑖superscript𝒙subscriptsuperscript𝑛𝑖subscript𝑊𝑖𝑅subscript𝑞𝑖superscript𝒙\operatorname{BM25}(\bm{x}_{i},\bm{x}^{*})=\sum^{n}_{i}W_{i}R(q_{i},\bm{x}^{*})BM25 ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is each token of 𝒙isubscript𝒙𝑖\bm{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, R(qi,𝒙)𝑅subscript𝑞𝑖superscript𝒙R(q_{i},\bm{x}^{*})italic_R ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the term frequency and inverse document frequency statistics that measure the coverage of a particular term and the relative importance of terms. In this section, we replace the cosine similarity score with the BM25 similarity score to verify the effectiveness of our proposed method.

Our results in Table 8 show that the improvement still holds when BM25 is used as the cluster, confirming the superiority and robustness of our method compared with the naive demonstration selection methods. The above result also demonstrates that examples that are semantically similar in both token space and word space share the same level of inherent perplexity.

Reordering Some studies demonstrate that in-context learning is highly sensitive for demonstrations’ ordering when using random demonstrations [25, 30, 60]. Specifically, the same randomly sampled demonstrations with different orders can lead to the performance between random guesses and near state-of-the-art. In LPR, we reorder exemplars based on their similarities to the test input in ascending order after the process of our method, in accordance with common practices [23, 28, 45, 47, 62]. Here we compare our method with reordering and without reordering to explore the effect of reordering on example-specific demonstrations retrieved by our method.

Across our experiments, Table 9 shows our method without the reordering process still improves the existing demonstration selection methods across various types of noise. The above results indicate that local perplexity ranking rather than the reordering process is crucial for the success of noise-robust ICL. Additionally, we believe high-quality demonstrations are less sensitive to the ordering and stabilize in-context learning, which is consistent with the previous work [16].

Various demonstration sizes To verify the effectiveness of our proposed method, we also present the ICL performance with our method across various set sizes K𝐾Kitalic_K. Concretely, Tables 10 and 11 report the number of demonstrations to be 2 and 8 and show our method effectively mitigates the issue of noisy annotation in various demonstration selection methods across various demonstration sizes.

Table 8: Average results with BM25 as similarity score. The result of The bold indicates the improvement by integrating LPR.
Clean Irelevant Noise Relevant Noise
Dataset Method 0% 20% 40% 60% 20% 40% 60%
NQ Random 14.51±plus-or-minus\pm±0.51 10.97±plus-or-minus\pm±0.29 7.37±plus-or-minus\pm±0.45 4.23±plus-or-minus\pm±0.46 12.00±plus-or-minus\pm±0.65 9.67±plus-or-minus\pm±0.45 6.40±plus-or-minus\pm±1.02
+Ours 15.15±plus-or-minus\pm±0.20 13.45±plus-or-minus\pm±0.85 10.98±plus-or-minus\pm±0.47 7.51±plus-or-minus\pm±0.43 14.14±plus-or-minus\pm±0.48 12.08±plus-or-minus\pm±0.53 10.12±plus-or-minus\pm±0.53
TopK 20.25±plus-or-minus\pm±0.10 13.95±plus-or-minus\pm±1.14 9.97±plus-or-minus\pm±1.13 5.90±plus-or-minus\pm±1.08 16.21±plus-or-minus\pm±0.22 12.22±plus-or-minus\pm±0.22 8.50±plus-or-minus\pm±0.28
+Ours 19.35±plus-or-minus\pm±0.21 17.51±plus-or-minus\pm±0.08 14.44±plus-or-minus\pm±0.41 10.25±plus-or-minus\pm±0.45 17.39±plus-or-minus\pm±0.45 14.38±plus-or-minus\pm±0.90 11.71±plus-or-minus\pm±0.71
DPP 20.35±plus-or-minus\pm±0.76 14.69±plus-or-minus\pm±0.94 9.87±plus-or-minus\pm±0.49 5.97±plus-or-minus\pm±0.48 15.47±plus-or-minus\pm±1.00 11.28±plus-or-minus\pm±0.42 7.89±plus-or-minus\pm±0.25
+Ours 19.45±plus-or-minus\pm±0.97 17.29±plus-or-minus\pm±1.19 14.08±plus-or-minus\pm±0.98 10.45±plus-or-minus\pm±0.68 17.07±plus-or-minus\pm±0.84 15.02±plus-or-minus\pm±0.59 12.14±plus-or-minus\pm±0.75
WebQ Random 20.37±plus-or-minus\pm±0.64 15.18±plus-or-minus\pm±1.06 10.39±plus-or-minus\pm±0.83 4.83±plus-or-minus\pm±0.17 18.29±plus-or-minus\pm±0.43 15.92±plus-or-minus\pm±0.68 13.50±plus-or-minus\pm±0.17
+Ours 21.18±plus-or-minus\pm±0.14 19.83±plus-or-minus\pm±0.71 16.40±plus-or-minus\pm±0.28 10.89±plus-or-minus\pm±0.24 20.38±plus-or-minus\pm±0.71 18.54±plus-or-minus\pm±0.48 15.92±plus-or-minus\pm±0.48
TopK 30.16±plus-or-minus\pm±0.58 22.52±plus-or-minus\pm±0.64 14.52±plus-or-minus\pm±0.78 8.00±plus-or-minus\pm±1.12 27.19±plus-or-minus\pm±0.27 22.82±plus-or-minus\pm±0.75 18.88±plus-or-minus\pm±1.09
+Ours 28.82±plus-or-minus\pm±0.72 26.51±plus-or-minus\pm±0.39 22.03±plus-or-minus\pm±1.26 14.74±plus-or-minus\pm±0.25 27.56±plus-or-minus\pm±0.20 25.08±plus-or-minus\pm±0.36 21.58±plus-or-minus\pm±0.21
DPP 29.40±plus-or-minus\pm±0.39 22.11±plus-or-minus\pm±0.81 13.72±plus-or-minus\pm±0.27 7.33±plus-or-minus\pm±0.68 26.18±plus-or-minus\pm±1.04 21.53±plus-or-minus\pm±0.61 16.80±plus-or-minus\pm±0.17
+Ours 29.15±plus-or-minus\pm±0.21 26.30±plus-or-minus\pm±0.93 20.93±plus-or-minus\pm±1.42 13.72±plus-or-minus\pm±0.57 27.83±plus-or-minus\pm±0.33 25.08±plus-or-minus\pm±0.93 20.57±plus-or-minus\pm±1.27
SQuAD Random 56.50±plus-or-minus\pm±0.57 50.00±plus-or-minus\pm±0.62 39.10±plus-or-minus\pm±0.88 26.20±plus-or-minus\pm±0.79 53.90±plus-or-minus\pm±0.65 49.17±plus-or-minus\pm±0.62 42.03±plus-or-minus\pm±0.79
+Ours 56.47±plus-or-minus\pm±0.25 54.73±plus-or-minus\pm±1.10 51.53±plus-or-minus\pm±1.59 43.03±plus-or-minus\pm±1.51 54.77±plus-or-minus\pm±0.76 52.83±plus-or-minus\pm±0.97 49.70±plus-or-minus\pm±0.08
TopK 56.97±plus-or-minus\pm±0.69 51.83±plus-or-minus\pm±1.03 42.83±plus-or-minus\pm±1.68 29.10±plus-or-minus\pm±2.92 54.77±plus-or-minus\pm±0.69 49.37±plus-or-minus\pm±1.37 41.37±plus-or-minus\pm±2.09
+Ours 56.83±plus-or-minus\pm±0.19 55.60±plus-or-minus\pm±1.45 50.33±plus-or-minus\pm±0.62 40.83±plus-or-minus\pm±2.82 55.70±plus-or-minus\pm±0.99 53.07±plus-or-minus\pm±0.65 48.17±plus-or-minus\pm±1.92
DPP 57.29±plus-or-minus\pm±0.87 50.57±plus-or-minus\pm±0.33 41.63±plus-or-minus\pm±1.00 25.67±plus-or-minus\pm±2.52 56.10±plus-or-minus\pm±0.59 49.57±plus-or-minus\pm±1.24 43.37±plus-or-minus\pm±0.78
+Ours 57.20±plus-or-minus\pm±1.00 56.50±plus-or-minus\pm±0.83 52.70±plus-or-minus\pm±0.86 44.73±plus-or-minus\pm±1.19 56.43±plus-or-minus\pm±1.13 53.47±plus-or-minus\pm±0.81 50.57±plus-or-minus\pm±1.19
SCIQ Random 68.15±plus-or-minus\pm±0.28 59.19±plus-or-minus\pm±1.57 44.19±plus-or-minus\pm±2.89 28.21±plus-or-minus\pm±2.96 64.59±plus-or-minus\pm±1.42 58.39±plus-or-minus\pm±0.16 49.54±plus-or-minus\pm±0.80
+Ours 69.25±plus-or-minus\pm±0.86 64.14±plus-or-minus\pm±1.47 54.37±plus-or-minus\pm±1.88 37.64±plus-or-minus\pm±0.58 66.49±plus-or-minus\pm±1.14 62.24±plus-or-minus\pm±0.86 54.19±plus-or-minus\pm±0.82
TopK 68.62±plus-or-minus\pm±1.13 59.59±plus-or-minus\pm±1.28 45.77±plus-or-minus\pm±2.68 29.31±plus-or-minus\pm±1.73 64.66±plus-or-minus\pm±1.34 58.54±plus-or-minus\pm±0.12 49.47±plus-or-minus\pm±0.65
+Ours 70.11±plus-or-minus\pm±0.36 63.79±plus-or-minus\pm±2.87 57.58±plus-or-minus\pm±1.52 38.90±plus-or-minus\pm±2.93 66.55±plus-or-minus\pm±2.74 60.23±plus-or-minus\pm±5.44 51.95±plus-or-minus\pm±4.59
DPP 67.29±plus-or-minus\pm±0.35 57.69±plus-or-minus\pm±1.83 45.34±plus-or-minus\pm±1.56 28.50±plus-or-minus\pm±1.78 64.88±plus-or-minus\pm±0.43 58.91±plus-or-minus\pm±0.64 50.00±plus-or-minus\pm±0.85
+Ours 69.78±plus-or-minus\pm±1.00 64.94±plus-or-minus\pm±1.42 55.34±plus-or-minus\pm±2.12 41.21±plus-or-minus\pm±1.52 67.64±plus-or-minus\pm±0.86 63.85±plus-or-minus\pm±2.05 56.43±plus-or-minus\pm±2.50
GeoQuery Random 27.97±plus-or-minus\pm±0.99 23.18±plus-or-minus\pm±0.62 17.44±plus-or-minus\pm±1.56 14.10±plus-or-minus\pm±0.74 26.48±plus-or-minus\pm±0.17 26.13±plus-or-minus\pm±0.05 26.25±plus-or-minus\pm±0.40
+Ours 29.99±plus-or-minus\pm±0.50 29.35±plus-or-minus\pm±0.26 25.69±plus-or-minus\pm±0.91 25.11±plus-or-minus\pm±0.64 29.77±plus-or-minus\pm±0.35 28.09±plus-or-minus\pm±0.50 26.80±plus-or-minus\pm±0.55
TopK 44.17±plus-or-minus\pm±0.09 27.28±plus-or-minus\pm±2.65 17.49±plus-or-minus\pm±2.05 9.96±plus-or-minus\pm±3.08 41.31±plus-or-minus\pm±0.46 38.48±plus-or-minus\pm±0.63 34.90±plus-or-minus\pm±0.69
+Ours 43.06±plus-or-minus\pm±0.60 41.61±plus-or-minus\pm±1.00 41.19±plus-or-minus\pm±1.42 32.76±plus-or-minus\pm±0.45 40.99±plus-or-minus\pm±38.71 38.87±plus-or-minus\pm±0.49 36.26±plus-or-minus\pm±0.03
DPP 45.81±plus-or-minus\pm±0.71 31.79±plus-or-minus\pm±5.93 21.54±plus-or-minus\pm±3.36 10.61±plus-or-minus\pm±0.15 42.97±plus-or-minus\pm±1.96 39.91±plus-or-minus\pm±0.42 33.34±plus-or-minus\pm±0.53
+Ours 43.92±plus-or-minus\pm±3.44 41.32±plus-or-minus\pm±3.55 38.37±plus-or-minus\pm±4.19 26.78±plus-or-minus\pm±3.32 41.70±plus-or-minus\pm±1.22 39.79±plus-or-minus\pm±2.13 35.34±plus-or-minus\pm±2.10
NL2Bash Random 27.91±plus-or-minus\pm±0.37 25.37±plus-or-minus\pm±0.21 15.77±plus-or-minus\pm±0.91 8.95±plus-or-minus\pm±0.65 27.20±plus-or-minus\pm±1.06 28.09±plus-or-minus\pm±0.51 26.27±plus-or-minus\pm±0.56
+Ours 29.15±plus-or-minus\pm±0.21 26.30±plus-or-minus\pm±0.93 20.93±plus-or-minus\pm±1.42 13.72±plus-or-minus\pm±0.57 28.83±plus-or-minus\pm±0.33 28.08±plus-or-minus\pm±0.93 27.57±plus-or-minus\pm±1.27
TopK 35.71±plus-or-minus\pm±0.42 27.40±plus-or-minus\pm±0.26 20.00±plus-or-minus\pm±0.62 9.95±plus-or-minus\pm±0.68 32.57±plus-or-minus\pm±0.13 30.21±plus-or-minus\pm±0.08 27.48±plus-or-minus\pm±0.35
+Ours 32.42±plus-or-minus\pm±0.26 29.85±plus-or-minus\pm±2.99 30.10±plus-or-minus\pm±2.11 23.67±plus-or-minus\pm±1.02 31.18±plus-or-minus\pm±38.71 31.03±plus-or-minus\pm±3.80 28.84±plus-or-minus\pm±2.48
DPP 37.77±plus-or-minus\pm±0.02 31.52±plus-or-minus\pm±0.12 23.23±plus-or-minus\pm±0.34 11.16±plus-or-minus\pm±2.14 32.74±plus-or-minus\pm±0.29 32.56±plus-or-minus\pm±0.61 26.72±plus-or-minus\pm±1.58
+Ours 36.69±plus-or-minus\pm±3.30 32.63±plus-or-minus\pm±3.32 29.10±plus-or-minus\pm±4.10 23.56±plus-or-minus\pm±2.65 33.18±plus-or-minus\pm±2.51 32.19±plus-or-minus\pm±3.46 28.65±plus-or-minus\pm±1.80
Table 9: Average results without reordering process. The result of The bold indicates the improvement by integrating LPR.
Clean Irelevant Noise Relevant Noise
Dataset Method 0% 20% 40% 60% 20% 40% 60%
NQ Random 14.51±plus-or-minus\pm±0.51 10.97±plus-or-minus\pm±0.29 7.37±plus-or-minus\pm±0.45 4.23±plus-or-minus\pm±0.46 12.00±plus-or-minus\pm±0.65 9.67±plus-or-minus\pm±0.45 6.40±plus-or-minus\pm±1.02
+Ours 15.35±plus-or-minus\pm±0.83 14.58±plus-or-minus\pm±0.33 12.38±plus-or-minus\pm±0.09 9.24±plus-or-minus\pm±1.24 14.28±plus-or-minus\pm±0.46 12.95±plus-or-minus\pm±0.91 9.93±plus-or-minus\pm±0.94
TopK 20.25±plus-or-minus\pm±0.10 13.95±plus-or-minus\pm±1.14 9.97±plus-or-minus\pm±1.13 5.90±plus-or-minus\pm±1.08 16.21±plus-or-minus\pm±0.22 12.22±plus-or-minus\pm±0.22 8.50±plus-or-minus\pm±0.28
+Ours 19.65±plus-or-minus\pm±0.24 16.88±plus-or-minus\pm±0.40 13.21±plus-or-minus\pm±0.38 9.47±plus-or-minus\pm±0.38 17.42±plus-or-minus\pm±0.36 14.58±plus-or-minus\pm±0.26 11.61±plus-or-minus\pm±0.59
DPP 20.35±plus-or-minus\pm±0.76 14.69±plus-or-minus\pm±0.94 9.87±plus-or-minus\pm±0.49 5.97±plus-or-minus\pm±0.48 15.47±plus-or-minus\pm±1.00 11.28±plus-or-minus\pm±0.42 7.89±plus-or-minus\pm±0.25
+Ours 18.57±plus-or-minus\pm±0.24 17.45±plus-or-minus\pm±0.37 14.48±plus-or-minus\pm±0.85 11.44±plus-or-minus\pm±0.29 17.75±plus-or-minus\pm±0.29 15.45±plus-or-minus\pm±0.70 12.18±plus-or-minus\pm±0.87
WebQ Random 20.37±plus-or-minus\pm±0.64 15.18±plus-or-minus\pm±1.06 10.39±plus-or-minus\pm±0.83 4.83±plus-or-minus\pm±0.17 18.29±plus-or-minus\pm±0.43 15.92±plus-or-minus\pm±0.68 13.50±plus-or-minus\pm±0.17
+Ours 22.08±plus-or-minus\pm±0.31 20.38±plus-or-minus\pm±0.74 16.91±plus-or-minus\pm±0.61 12.16±plus-or-minus\pm±0.21 21.64±plus-or-minus\pm±0.71 19.01±plus-or-minus\pm±0.78 17.06±plus-or-minus\pm±1.35
TopK 30.16±plus-or-minus\pm±0.58 22.52±plus-or-minus\pm±0.64 14.52±plus-or-minus\pm±0.78 8.00±plus-or-minus\pm±1.12 27.19±plus-or-minus\pm±0.27 22.82±plus-or-minus\pm±0.75 18.88±plus-or-minus\pm±1.09
+Ours 29.69±plus-or-minus\pm±0.22 26.96±plus-or-minus\pm±0.66 22.12±plus-or-minus\pm±1.08 15.98±plus-or-minus\pm±0.60 29.07±plus-or-minus\pm±0.04 27.26±plus-or-minus\pm±0.40 22.33±plus-or-minus\pm±1.13
DPP 29.40±plus-or-minus\pm±0.39 22.11±plus-or-minus\pm±0.81 13.72±plus-or-minus\pm±0.27 7.33±plus-or-minus\pm±0.68 26.18±plus-or-minus\pm±1.04 21.53±plus-or-minus\pm±0.61 16.80±plus-or-minus\pm±0.17
+Ours 29.15±plus-or-minus\pm±0.21 26.30±plus-or-minus\pm±0.93 20.93±plus-or-minus\pm±1.42 13.72±plus-or-minus\pm±0.57 27.83±plus-or-minus\pm±0.33 25.08±plus-or-minus\pm±0.93 20.57±plus-or-minus\pm±1.27
SQuAD Random 56.50±plus-or-minus\pm±0.57 50.00±plus-or-minus\pm±0.62 39.10±plus-or-minus\pm±0.88 26.20±plus-or-minus\pm±0.79 53.90±plus-or-minus\pm±0.65 49.17±plus-or-minus\pm±0.62 42.03±plus-or-minus\pm±0.79
+Ours 55.93±plus-or-minus\pm±0.75 54.23±plus-or-minus\pm±1.11 51.67±plus-or-minus\pm±0.39 41.37±plus-or-minus\pm±0.66 55.67±plus-or-minus\pm±0.52 53.13±plus-or-minus\pm±0.63 49.07±plus-or-minus\pm±0.74
TopK 56.97±plus-or-minus\pm±0.69 51.83±plus-or-minus\pm±1.03 42.83±plus-or-minus\pm±1.68 29.10±plus-or-minus\pm±2.92 54.77±plus-or-minus\pm±0.69 49.37±plus-or-minus\pm±1.37 41.37±plus-or-minus\pm±2.09
+Ours 57.83±plus-or-minus\pm±0.97 54.87±plus-or-minus\pm±0.83 50.97±plus-or-minus\pm±0.70 39.00±plus-or-minus\pm±3.12 56.40±plus-or-minus\pm±0.37 52.77±plus-or-minus\pm±0.83 47.63±plus-or-minus\pm±0.94
DPP 57.29±plus-or-minus\pm±0.87 50.57±plus-or-minus\pm±0.33 41.63±plus-or-minus\pm±1.00 25.67±plus-or-minus\pm±2.52 56.10±plus-or-minus\pm±0.59 49.57±plus-or-minus\pm±1.24 43.37±plus-or-minus\pm±0.78
+Ours 57.47±plus-or-minus\pm±0.25 57.53±plus-or-minus\pm±0.97 52.03±plus-or-minus\pm±0.39 44.00±plus-or-minus\pm±1.10 57.27±plus-or-minus\pm±0.40 55.00±plus-or-minus\pm±0.22 50.27±plus-or-minus\pm±1.51
SCIQ Random 68.15±plus-or-minus\pm±0.28 59.19±plus-or-minus\pm±1.57 44.19±plus-or-minus\pm±2.89 28.21±plus-or-minus\pm±2.96 64.59±plus-or-minus\pm±1.42 58.39±plus-or-minus\pm±0.16 49.54±plus-or-minus\pm±0.80
+Ours 68.56±plus-or-minus\pm±1.17 64.88±plus-or-minus\pm±1.22 54.94±plus-or-minus\pm±1.00 40.63±plus-or-minus\pm±2.62 66.67±plus-or-minus\pm±1.34 62.41±plus-or-minus\pm±0.24 54.03±plus-or-minus\pm±1.69
TopK 68.62±plus-or-minus\pm±1.13 59.59±plus-or-minus\pm±1.28 45.77±plus-or-minus\pm±2.68 29.31±plus-or-minus\pm±1.73 64.66±plus-or-minus\pm±1.34 58.54±plus-or-minus\pm±0.12 49.47±plus-or-minus\pm±0.65
+Ours 70.00±plus-or-minus\pm±0.25 66.26±plus-or-minus\pm±0.35 56.32±plus-or-minus\pm±1.90 41.03±plus-or-minus\pm±1.89 68.19±plus-or-minus\pm±0.13 63.27±plus-or-minus\pm±0.75 55.17±plus-or-minus\pm±2.12
DPP 67.29±plus-or-minus\pm±0.35 57.69±plus-or-minus\pm±1.83 45.34±plus-or-minus\pm±1.56 28.50±plus-or-minus\pm±1.78 64.88±plus-or-minus\pm±0.43 58.91±plus-or-minus\pm±0.64 50.00±plus-or-minus\pm±0.85
+Ours 70.00±plus-or-minus\pm±0.61 66.26±plus-or-minus\pm±1.27 56.03±plus-or-minus\pm±2.04 43.44±plus-or-minus\pm±2.54 68.70±plus-or-minus\pm±0.21 63.22±plus-or-minus\pm±1.84 54.77±plus-or-minus\pm±2.47
GeoQuery Random 27.97±plus-or-minus\pm±0.99 23.18±plus-or-minus\pm±0.62 17.44±plus-or-minus\pm±1.56 14.10±plus-or-minus\pm±0.74 26.48±plus-or-minus\pm±0.17 26.13±plus-or-minus\pm±0.05 26.25±plus-or-minus\pm±0.40
+Ours 28.58±plus-or-minus\pm±0.59 28.60±plus-or-minus\pm±0.03 28.89±plus-or-minus\pm±1.77 22.61±plus-or-minus\pm±0.53 27.80±plus-or-minus\pm±0.27 28.45±plus-or-minus\pm±0.32 26.86±plus-or-minus\pm±0.76
TopK 44.17±plus-or-minus\pm±0.09 27.28±plus-or-minus\pm±2.65 17.49±plus-or-minus\pm±2.05 9.96±plus-or-minus\pm±3.08 41.31±plus-or-minus\pm±0.46 38.48±plus-or-minus\pm±0.63 34.90±plus-or-minus\pm±0.69
+Ours 45.63±plus-or-minus\pm±0.11 43.62±plus-or-minus\pm±0.70 35.05±plus-or-minus\pm±2.86 26.03±plus-or-minus\pm±4.93 42.74±plus-or-minus\pm±0.45 39.81±plus-or-minus\pm±0.80 35.75±plus-or-minus\pm±0.03
DPP 45.81±plus-or-minus\pm±0.71 31.79±plus-or-minus\pm±5.93 21.54±plus-or-minus\pm±3.36 10.61±plus-or-minus\pm±0.15 42.97±plus-or-minus\pm±1.96 39.91±plus-or-minus\pm±0.42 33.34±plus-or-minus\pm±0.53
+Ours 44.73±plus-or-minus\pm±0.56 45.10±plus-or-minus\pm±0.50 40.26±plus-or-minus\pm±1.06 32.54±plus-or-minus\pm±1.25 41.64±plus-or-minus\pm±0.64 40.78±plus-or-minus\pm±0.89 35.09±plus-or-minus\pm±0.79
NL2Bash Random 27.91±plus-or-minus\pm±0.37 25.37±plus-or-minus\pm±0.21 15.77±plus-or-minus\pm±0.91 8.95±plus-or-minus\pm±0.65 27.20±plus-or-minus\pm±1.06 28.09±plus-or-minus\pm±0.51 26.27±plus-or-minus\pm±0.56
+Ours 25.54±plus-or-minus\pm±2.19 25.02±plus-or-minus\pm±1.25 23.05±plus-or-minus\pm±2.22 21.28±plus-or-minus\pm±2.12 27.63±plus-or-minus\pm±0.58 24.21±plus-or-minus\pm±0.66 24.09±plus-or-minus\pm±0.38
TopK 35.71±plus-or-minus\pm±0.42 27.40±plus-or-minus\pm±0.26 20.00±plus-or-minus\pm±0.62 9.95±plus-or-minus\pm±0.68 32.57±plus-or-minus\pm±0.13 30.21±plus-or-minus\pm±0.08 27.48±plus-or-minus\pm±0.35
+Ours 32.91±plus-or-minus\pm±0.21 31.33±plus-or-minus\pm±0.50 29.83±plus-or-minus\pm±0.31 22.20±plus-or-minus\pm±0.95 31.39±plus-or-minus\pm±0.74 31.14±plus-or-minus\pm±0.46 29.09±plus-or-minus\pm±1.77
DPP 37.77±plus-or-minus\pm±0.02 31.52±plus-or-minus\pm±0.12 23.23±plus-or-minus\pm±0.34 11.16±plus-or-minus\pm±2.14 32.74±plus-or-minus\pm±0.29 32.56±plus-or-minus\pm±0.61 26.72±plus-or-minus\pm±1.58
+Ours 38.37±plus-or-minus\pm±0.32 31.81±plus-or-minus\pm±1.08 24.27±plus-or-minus\pm±2.14 13.09±plus-or-minus\pm±0.05 34.43±plus-or-minus\pm±0.91 32.32±plus-or-minus\pm±1.94 28.76±plus-or-minus\pm±0.88
Table 10: Average in-context learning performance with 2 demonstrations on 6 datasets across various types of noisy annotation (over 3 runs). The bold indicates the improved results by integrating LPR.
Clean Irelevant Noise Relevant Noise
Dataset Method 0% 20% 40% 60% 20% 40% 60%
NQ Random 11.70±plus-or-minus\pm±0.49 9.27±plus-or-minus\pm±0.75 6.70±plus-or-minus\pm±0.81 4.57±plus-or-minus\pm±0.34 11.04±plus-or-minus\pm±0.27 8.83±plus-or-minus\pm±0.54 5.80±plus-or-minus\pm±0.45
+Ours 12.01±plus-or-minus\pm±0.71 11.38±plus-or-minus\pm±0.92 10.71±plus-or-minus\pm±0.35 8.27±plus-or-minus\pm±0.75 11.58±plus-or-minus\pm±0.78 10.38±plus-or-minus\pm±0.87 8.90±plus-or-minus\pm±0.51
TopK 14.61±plus-or-minus\pm±0.49 12.18±plus-or-minus\pm±0.05 9.53±plus-or-minus\pm±0.53 6.42±plus-or-minus\pm±0.38 11.96±plus-or-minus\pm±0.62 8.89±plus-or-minus\pm±0.21 7.14±plus-or-minus\pm±0.21
+Ours 14.28±plus-or-minus\pm±0.31 12.84±plus-or-minus\pm±0.31 11.44±plus-or-minus\pm±0.54 9.31±plus-or-minus\pm±1.13 12.98±plus-or-minus\pm±0.31 11.84±plus-or-minus\pm±0.46 9.77±plus-or-minus\pm±0.66
DPP 15.48±plus-or-minus\pm±0.26 12.03±plus-or-minus\pm±0.12 9.03±plus-or-minus\pm±0.31 6.34±plus-or-minus\pm±0.49 12.21±plus-or-minus\pm±0.21 9.17±plus-or-minus\pm±0.54 6.87±plus-or-minus\pm±0.36
+Ours 14.68±plus-or-minus\pm±0.61 13.91±plus-or-minus\pm±0.65 12.11±plus-or-minus\pm±1.28 10.04±plus-or-minus\pm±1.26 14.44±plus-or-minus\pm±0.50 12.41±plus-or-minus\pm±0.33 10.33±plus-or-minus\pm±0.45
WebQ Random 16.06±plus-or-minus\pm±0.74 12.95±plus-or-minus\pm±0.20 9.89±plus-or-minus\pm±1.38 6.65±plus-or-minus\pm±0.74 14.32±plus-or-minus\pm±0.34 13.30±plus-or-minus\pm±1.45 11.74±plus-or-minus\pm±0.76
+Ours 16.28±plus-or-minus\pm±0.65 16.20±plus-or-minus\pm±0.30 13.36±plus-or-minus\pm±0.33 14.41±plus-or-minus\pm±1.22 16.58±plus-or-minus\pm±0.34 15.21±plus-or-minus\pm±0.51 13.89±plus-or-minus\pm±0.43
TopK 20.04±plus-or-minus\pm±0.13 16.33±plus-or-minus\pm±0.17 11.88±plus-or-minus\pm±0.07 8.43±plus-or-minus\pm±0.45 17.59±plus-or-minus\pm±1.41 14.82±plus-or-minus\pm±0.16 11.52±plus-or-minus\pm±0.65
+Ours 20.50±plus-or-minus\pm±0.52 19.44±plus-or-minus\pm±0.59 17.34±plus-or-minus\pm±0.95 14.41±plus-or-minus\pm±1.22 20.51±plus-or-minus\pm±0.45 18.34±plus-or-minus\pm±0.84 15.87±plus-or-minus\pm±0.67
DPP 22.68±plus-or-minus\pm±0.61 18.10±plus-or-minus\pm±0.86 13.12±plus-or-minus\pm±0.44 8.66±plus-or-minus\pm±0.51 19.60±plus-or-minus\pm±0.17 17.32±plus-or-minus\pm±0.57 13.97±plus-or-minus\pm±0.63
+Ours 22.20±plus-or-minus\pm±0.74 20.43±plus-or-minus\pm±0.40 17.96±plus-or-minus\pm±0.78 13.91±plus-or-minus\pm±0.76 21.67±plus-or-minus\pm±0.35 20.79±plus-or-minus\pm±0.83 16.89±plus-or-minus\pm±0.49
SQuAD Random 45.07±plus-or-minus\pm±0.37 41.13±plus-or-minus\pm±1.03 34.77±plus-or-minus\pm±0.39 27.60±plus-or-minus\pm±1.40 43.27±plus-or-minus\pm±0.59 38.83±plus-or-minus\pm±1.21 35.63±plus-or-minus\pm±0.62
+Ours 44.57±plus-or-minus\pm±0.48 42.43±plus-or-minus\pm±1.03 42.60±plus-or-minus\pm±1.02 37.40±plus-or-minus\pm±0.64 46.10±plus-or-minus\pm±1.98 42.70±plus-or-minus\pm±0.65 40.03±plus-or-minus\pm±0.40
TopK 45.13±plus-or-minus\pm±0.76 40.57±plus-or-minus\pm±0.94 36.00±plus-or-minus\pm±0.51 29.17±plus-or-minus\pm±1.33 42.73±plus-or-minus\pm±1.03 41.73±plus-or-minus\pm±0.59 35.57±plus-or-minus\pm±1.26
+Ours 45.20±plus-or-minus\pm±0.83 44.37±plus-or-minus\pm±0.65 41.17±plus-or-minus\pm±0.58 35.53±plus-or-minus\pm±1.71 44.07±plus-or-minus\pm±0.45 42.47±plus-or-minus\pm±1.14 39.67±plus-or-minus\pm±1.36
DPP 46.23±plus-or-minus\pm±1.58 41.67±plus-or-minus\pm±1.72 35.43±plus-or-minus\pm±1.76 27.90±plus-or-minus\pm±0.75 43.33±plus-or-minus\pm±0.57 40.20±plus-or-minus\pm±0.50 37.23±plus-or-minus\pm±0.60
+Ours 46.67±plus-or-minus\pm±0.59 44.53±plus-or-minus\pm±0.68 42.17±plus-or-minus\pm±1.03 37.10±plus-or-minus\pm±0.41 44.77±plus-or-minus\pm±0.29 43.23±plus-or-minus\pm±0.92 41.20±plus-or-minus\pm±0.22
SCIQ Random 66.48±plus-or-minus\pm±0.34 62.01±plus-or-minus\pm±0.96 51.03±plus-or-minus\pm±2.33 40.36±plus-or-minus\pm±1.94 64.65±plus-or-minus\pm±0.92 59.48±plus-or-minus\pm±0.75 56.89±plus-or-minus\pm±0.25
+Ours 65.17±plus-or-minus\pm±0.56 62.64±plus-or-minus\pm±0.69 57.59±plus-or-minus\pm±1.01 48.79±plus-or-minus\pm±1.60 64.08±plus-or-minus\pm±0.80 61.61±plus-or-minus\pm±0.33 56.26±plus-or-minus\pm±1.92
TopK 65.17±plus-or-minus\pm±0.49 58.50±plus-or-minus\pm±1.10 50.29±plus-or-minus\pm±1.06 40.54±plus-or-minus\pm±1.68 62.76±plus-or-minus\pm±1.36 58.27±plus-or-minus\pm±0.73 54.54±plus-or-minus\pm±0.86
+Ours 67.04±plus-or-minus\pm±0.50 64.60±plus-or-minus\pm±0.63 57.81±plus-or-minus\pm±1.69 49.88±plus-or-minus\pm±1.79 65.63±plus-or-minus\pm±0.80 61.26±plus-or-minus\pm±0.80 55.46±plus-or-minus\pm±3.17
DPP 67.33±plus-or-minus\pm±0.74 61.37±plus-or-minus\pm±0.98 51.49±plus-or-minus\pm±1.06 41.26±plus-or-minus\pm±1.75 62.53±plus-or-minus\pm±1.06 58.16±plus-or-minus\pm±0.70 53.79±plus-or-minus\pm±0.51
+Ours 67.24±plus-or-minus\pm±0.92 64.48±plus-or-minus\pm±1.60 59.37±plus-or-minus\pm±0.84 50.57±plus-or-minus\pm±2.34 66.95±plus-or-minus\pm±2.41 62.12±plus-or-minus\pm±1.18 56.84±plus-or-minus\pm±2.00
GeoQuery Random 24.11±plus-or-minus\pm±1.06 18.22±plus-or-minus\pm±0.87 12.35±plus-or-minus\pm±0.35 7.36±plus-or-minus\pm±0.51 24.55±plus-or-minus\pm±0.42 21.55±plus-or-minus\pm±0.46 20.40±plus-or-minus\pm±0.41
+Ours 23.67±plus-or-minus\pm±0.98 22.23±plus-or-minus\pm±0.27 19.99±plus-or-minus\pm±0.17 16.29±plus-or-minus\pm±0.96 22.84±plus-or-minus\pm±0.68 22.78±plus-or-minus\pm±0.22 22.34±plus-or-minus\pm±1.24
TopK 41.48±plus-or-minus\pm±0.41 32.11±plus-or-minus\pm±0.69 26.11±plus-or-minus\pm±1.92 18.57±plus-or-minus\pm±3.32 40.08±plus-or-minus\pm±1.57 36.97±plus-or-minus\pm±1.29 33.27±plus-or-minus\pm±1.88
+Ours 41.1±plus-or-minus\pm±0.43 40.82±plus-or-minus\pm±0.70 41.51±plus-or-minus\pm±0.76 37.08±plus-or-minus\pm±0.55 38.72±plus-or-minus\pm±0.92 36.61±plus-or-minus\pm±1.10 35.16±plus-or-minus\pm±0.56
TopK 43.63±plus-or-minus\pm±0.79 35.91±plus-or-minus\pm±3.46 25.77±plus-or-minus\pm±1.34 14.66±plus-or-minus\pm±0.25 39.11±plus-or-minus\pm±1.53 35.88±plus-or-minus\pm±0.98 32.01±plus-or-minus\pm±2.06
DPP 41.97±plus-or-minus\pm±0.05 40.67±plus-or-minus\pm±1.25 41.01±plus-or-minus\pm±0.26 36.42±plus-or-minus\pm±0.12 38.13±plus-or-minus\pm±0.33 35.11±plus-or-minus\pm±0.50 33.66±plus-or-minus\pm±0.05
NL2Bash Random 28.56±plus-or-minus\pm±0.89 21.45±plus-or-minus\pm±2.64 19.07±plus-or-minus\pm±0.75 14.25±plus-or-minus\pm±2.48 26.87±plus-or-minus\pm±0.82 25.69±plus-or-minus\pm±0.26 24.47±plus-or-minus\pm±0.73
+Ours 26.35±plus-or-minus\pm±0.20 24.37±plus-or-minus\pm±0.32 25.44±plus-or-minus\pm±1.13 20.73±plus-or-minus\pm±0.29 26.22±plus-or-minus\pm±0.25 26.54±plus-or-minus\pm±0.28 26.10±plus-or-minus\pm±1.75
TopK 31.83±plus-or-minus\pm±0.10 28.85±plus-or-minus\pm±0.68 22.3±plus-or-minus\pm±2.01 17.08±plus-or-minus\pm±3.47 31.51±plus-or-minus\pm±0.82 27.73±plus-or-minus\pm±0.36 24.04±plus-or-minus\pm±0.91
+Ours 35.10±plus-or-minus\pm±0.06 34.11±plus-or-minus\pm±0.51 30.73±plus-or-minus\pm±0.47 26.04±plus-or-minus\pm±1.00 34.02±plus-or-minus\pm±0.41 30.22±plus-or-minus\pm±0.54 27.12±plus-or-minus\pm±0.81
DPP 35.13±plus-or-minus\pm±1.07 31.31±plus-or-minus\pm±0.85 25.84±plus-or-minus\pm±0.92 17.28±plus-or-minus\pm±0.72 34.95±plus-or-minus\pm±0.43 32.14±plus-or-minus\pm±0.59 27.61±plus-or-minus\pm±0.85
+Our 33.79±plus-or-minus\pm±0.33 30.84±plus-or-minus\pm±0.63 30.34±plus-or-minus\pm±1.02 26.41±plus-or-minus\pm±1.58 32.88±plus-or-minus\pm±1.34 31.40±plus-or-minus\pm±0.59 28.82±plus-or-minus\pm±0.30
Table 11: Average in-context learning performance with 8 demonstrations on 6 datasets across various types of noisy annotation (over 3 runs). The bold indicates the improved results by integrating LPR.
Clean Irelevant Noise Relevant Noise
Dataset Method 0% 20% 40% 60% 20% 40% 60%
NQ Random 16.25±plus-or-minus\pm±0.95 11.62±plus-or-minus\pm±0.24 6.15±plus-or-minus\pm±0.51 3.17±plus-or-minus\pm±0.17 12.72±plus-or-minus\pm±0.44 9.37±plus-or-minus\pm±0.24 6.17±plus-or-minus\pm±0.52
+Ours 16.55±plus-or-minus\pm±0.47 14.08±plus-or-minus\pm±0.61 11.88±plus-or-minus\pm±0.69 8.64±plus-or-minus\pm±0.70 14.58±plus-or-minus\pm±0.42 13.31±plus-or-minus\pm±0.64 9.74±plus-or-minus\pm±0.74
TopK 21.09±plus-or-minus\pm±0.42 14.91±plus-or-minus\pm±1.26 8.57±plus-or-minus\pm±0.40 5.47±plus-or-minus\pm±0.21 17.98±plus-or-minus\pm±0.34 12.71±plus-or-minus\pm±1.02 8.87±plus-or-minus\pm±0.85
+Ours 20.65±plus-or-minus\pm±0.09 16.65±plus-or-minus\pm±0.25 12.41±plus-or-minus\pm±0.65 8.14±plus-or-minus\pm±1.33 17.85±plus-or-minus\pm±0.47 15.45±plus-or-minus\pm±0.82 12.01±plus-or-minus\pm±0.65
DPP 19.65±plus-or-minus\pm±0.31 13.34±plus-or-minus\pm±0.84 8.64±plus-or-minus\pm±0.72 5.30±plus-or-minus\pm±0.00 16.31±plus-or-minus\pm±0.51 12.48±plus-or-minus\pm±1.07 8.43±plus-or-minus\pm±0.48
+Ours 18.68±plus-or-minus\pm±0.29 16.62±plus-or-minus\pm±0.45 13.71±plus-or-minus\pm±0.57 9.54±plus-or-minus\pm±1.14 17.49±plus-or-minus\pm±0.41 15.82±plus-or-minus\pm±0.17 12.22±plus-or-minus\pm±1.21
WebQ Random 22.70±plus-or-minus\pm±0.55 15.62±plus-or-minus\pm±0.34 8.33±plus-or-minus\pm±0.31 3.22±plus-or-minus\pm±0.37 19.91±plus-or-minus\pm±0.32 17.00±plus-or-minus\pm±0.73 13.52±plus-or-minus\pm±0.75
+Ours 22.93±plus-or-minus\pm±0.47 20.62±plus-or-minus\pm±0.60 16.18±plus-or-minus\pm±0.89 9.71±plus-or-minus\pm±0.23 22.36±plus-or-minus\pm±0.27 20.52±plus-or-minus\pm±0.28 17.85±plus-or-minus\pm±1.05
TopK 33.52±plus-or-minus\pm±0.67 22.49±plus-or-minus\pm±0.95 12.51±plus-or-minus\pm±0.92 6.50±plus-or-minus\pm±0.57 28.69±plus-or-minus\pm±0.69 24.17±plus-or-minus\pm±0.31 19.50±plus-or-minus\pm±0.56
+Ours 31.64±plus-or-minus\pm±0.10 26.91±plus-or-minus\pm±0.37 19.52±plus-or-minus\pm±1.29 12.71±plus-or-minus\pm±1.50 29.29±plus-or-minus\pm±0.48 26.32±plus-or-minus\pm±1.39 21.86±plus-or-minus\pm±0.60
DPP 31.49±plus-or-minus\pm±0.27 22.66±plus-or-minus\pm±0.95 12.51±plus-or-minus\pm±0.48 5.27±plus-or-minus\pm±1.43 27.64±plus-or-minus\pm±0.64 22.90±plus-or-minus\pm±0.52 17.82±plus-or-minus\pm±0.00
+Ours 30.39±plus-or-minus\pm±0.10 26.00±plus-or-minus\pm±1.01 19.08±plus-or-minus\pm±0.46 11.47±plus-or-minus\pm±0.98 28.74±plus-or-minus\pm±0.43 26.43±plus-or-minus\pm±1.53 21.97±plus-or-minus\pm±1.33
SQuAD Random 58.70±plus-or-minus\pm±0.59 46.63±plus-or-minus\pm±1.20 27.80±plus-or-minus\pm±1.42 11.03±plus-or-minus\pm±0.62 54.37±plus-or-minus\pm±0.66 46.57±plus-or-minus\pm±1.02 35.90±plus-or-minus\pm±1.71
+Ours 57.73±plus-or-minus\pm±0.79 56.87±plus-or-minus\pm±0.47 48.50±plus-or-minus\pm±0.86 33.00±plus-or-minus\pm±1.31 57.70±plus-or-minus\pm±0.65 53.93±plus-or-minus\pm±0.33 47.57±plus-or-minus\pm±0.90
TopK 58.97±plus-or-minus\pm±0.42 49.80±plus-or-minus\pm±1.44 34.87±plus-or-minus\pm±1.68 15.53±plus-or-minus\pm±2.23 56.63±plus-or-minus\pm±0.80 49.60±plus-or-minus\pm±0.78 36.37±plus-or-minus\pm±1.16
+Ours 58.33±plus-or-minus\pm±0.29 55.60±plus-or-minus\pm±0.16 48.30±plus-or-minus\pm±0.08 30.27±plus-or-minus\pm±2.08 58.60±plus-or-minus\pm±1.56 55.13±plus-or-minus\pm±1.33 44.40±plus-or-minus\pm±0.86
DPP 56.93±plus-or-minus\pm±0.34 49.63±plus-or-minus\pm±1.35 33.17±plus-or-minus\pm±0.45 16.50±plus-or-minus\pm±1.31 55.63±plus-or-minus\pm±1.11 49.07±plus-or-minus\pm±0.74 36.47±plus-or-minus\pm±2.09
+Ours 57.67±plus-or-minus\pm±0.82 56.30±plus-or-minus\pm±0.14 50.53±plus-or-minus\pm±1.14 34.03±plus-or-minus\pm±2.43 57.83±plus-or-minus\pm±0.25 53.87±plus-or-minus\pm±1.25 44.87±plus-or-minus\pm±2.05
SCIQ Random 68.70±plus-or-minus\pm±0.16 52.47±plus-or-minus\pm±0.59 28.46±plus-or-minus\pm±1.13 12.30±plus-or-minus\pm±3.25 63.79±plus-or-minus\pm±0.64 49.82±plus-or-minus\pm±1.66 37.18±plus-or-minus\pm±1.23
+Ours 69.54±plus-or-minus\pm±0.33 62.58±plus-or-minus\pm±1.22 43.79±plus-or-minus\pm±2.46 26.49±plus-or-minus\pm±1.83 66.55±plus-or-minus\pm±1.70 58.62±plus-or-minus\pm±0.65 42.13±plus-or-minus\pm±3.07
TopK 68.91±plus-or-minus\pm±0.22 53.73±plus-or-minus\pm±0.22 31.25±plus-or-minus\pm±1.35 12.87±plus-or-minus\pm±2.46 62.13±plus-or-minus\pm±0.77 50.05±plus-or-minus\pm±1.05 35.32±plus-or-minus\pm±1.45
+Ours 70.00±plus-or-minus\pm±0.49 65.23±plus-or-minus\pm±1.78 47.53±plus-or-minus\pm±3.64 28.10±plus-or-minus\pm±4.98 67.87±plus-or-minus\pm±2.00 58.39±plus-or-minus\pm±3.14 43.33±plus-or-minus\pm±4.97
DPP 68.33±plus-or-minus\pm±0.72 55.80±plus-or-minus\pm±0.90 34.54±plus-or-minus\pm±1.92 15.29±plus-or-minus\pm±3.34 62.64±plus-or-minus\pm±1.87 52.41±plus-or-minus\pm±3.41 39.65±plus-or-minus\pm±2.12
+Ours 68.39±plus-or-minus\pm±0.43 65.14±plus-or-minus\pm±1.50 48.67±plus-or-minus\pm±1.62 29.13±plus-or-minus\pm±2.44 68.04±plus-or-minus\pm±1.59 58.22±plus-or-minus\pm±2.01 46.09±plus-or-minus\pm±1.70
GeoQuery Random 34.03±plus-or-minus\pm±0.25 25.49±plus-or-minus\pm±1.64 13.95±plus-or-minus\pm±2.4 3.02±plus-or-minus\pm±0.19 32.83±plus-or-minus\pm±0.25 30.98±plus-or-minus\pm±0.16 28.72±plus-or-minus\pm±0.13
+Ours 32.48±plus-or-minus\pm±0.46 33.36±plus-or-minus\pm±1.19 31.07±plus-or-minus\pm±1.66 23.76±plus-or-minus\pm±0.92 32.42±plus-or-minus\pm±0.14 31.03±plus-or-minus\pm±2.29 29.88±plus-or-minus\pm±0.84
TopK 45.18±plus-or-minus\pm±0.48 22.07±plus-or-minus\pm±4.26 10.12±plus-or-minus\pm±0.19 3.61±plus-or-minus\pm±1.39 42.63±plus-or-minus\pm±0.56 40.43±plus-or-minus\pm±0.28 34.06±plus-or-minus\pm±0.76
+Ours 45.08±plus-or-minus\pm±0.56 41.88±plus-or-minus\pm±0.10 27.72±plus-or-minus\pm±2.40 13.09±plus-or-minus\pm±1.91 42.73±plus-or-minus\pm±0.53 41.37±plus-or-minus\pm±0.22 35.53±plus-or-minus\pm±1.81
DPP 46.71±plus-or-minus\pm±0.29 25.01±plus-or-minus\pm±2.15 15.29±plus-or-minus\pm±0.05 8.10±plus-or-minus\pm±0.95 44.75±plus-or-minus\pm±0.48 39.74±plus-or-minus\pm±0.23 33.43±plus-or-minus\pm±0.65
+Ours 45.89±plus-or-minus\pm±0.27 46.09±plus-or-minus\pm±0.57 35.01±plus-or-minus\pm±0.95 23.43±plus-or-minus\pm±2.93 44.91±plus-or-minus\pm±1.02 40.70±plus-or-minus\pm±0.59 34.66±plus-or-minus\pm±0.60
NL2Bash Random 30.17±plus-or-minus\pm±0.54 21.57±plus-or-minus\pm±1.84 13.74±plus-or-minus\pm±2.31 4.37±plus-or-minus\pm±0.62 27.18±plus-or-minus\pm±0.71 28.10±plus-or-minus\pm±0.39 26.75±plus-or-minus\pm±1.08
+Ours 29.30±plus-or-minus\pm±0.97 28.38±plus-or-minus\pm±0.19 27.51±plus-or-minus\pm±2.27 18.32±plus-or-minus\pm±1.50 26.58±plus-or-minus\pm±0.37 28.45±plus-or-minus\pm±0.09 27.26±plus-or-minus\pm±0.80
TopK 36.17±plus-or-minus\pm±1.06 29.69±plus-or-minus\pm±0.16 16.17±plus-or-minus\pm±1.05 8.50±plus-or-minus\pm±1.02 33.35±plus-or-minus\pm±2.46 32.42±plus-or-minus\pm±0.77 29.08±plus-or-minus\pm±1.58
+Ours 35.16±plus-or-minus\pm±0.03 33.32±plus-or-minus\pm±0.20 27.73±plus-or-minus\pm±0.97 17.82±plus-or-minus\pm±4.71 33.14±plus-or-minus\pm±0.72 32.75±plus-or-minus\pm±0.20 29.69±plus-or-minus\pm±0.39
DPP 37.55±plus-or-minus\pm±0.56 29.87±plus-or-minus\pm±3.04 16.65±plus-or-minus\pm±0.36 6.30±plus-or-minus\pm±1.20 34.61±plus-or-minus\pm±0.46 32.24±plus-or-minus\pm±0.69 28.11±plus-or-minus\pm±0.64
+Our 36.93±plus-or-minus\pm±0.89 34.65±plus-or-minus\pm±0.23 30.45±plus-or-minus\pm±1.17 25.36±plus-or-minus\pm±1.00 34.57±plus-or-minus\pm±1.14 33.31±plus-or-minus\pm±1.26 31.90±plus-or-minus\pm±0.74
Table 12: All the datasets used in the experiments.
Task Dataset Train Set Test Set
Open-Domain QA NQ [22] 20,000 1,000
WebQ [5] 1,261 1,213
Reading Comprehension SQuAD [46] 20,000 1,000
SCIQ [56] 6,059 581
Code Generation GeoQuery [39] 530 253
NL2Bash [27] 5,000 606
Table 13: Templates of tasks. Placeholders(e.g. <Question> and <Answer>) will be replaced by real questions or answers.
Dataset Prompt Example
NQ Question: <Question> Answer: <Answer> Question: The bundles of neurons in the cns are called? Answer: Nucleus
WebQ Question: <Question> Answer: <Answer> Question: Where are the libyan refugees going? Answer: Tunisia
SQuAD Support: <Support> Question: <Question> Answer: <Answer> Support: Among the philosophies that have influenced modern architects and their approach to building design are rationalism, empiricism, structuralism, poststructuralism. Question: Which philosophy followed structuralism? Answer: poststructuralism
SCIQ Support: <Support> Question: <Question> Answer: <Answer> Support: Gravity keeps the Moon orbiting Earth. Gravity keeps the planets orbiting the Sun. Question: What keeps the moon orbiting earth? Answer: gravity
GeoQuery Question: <Question> Answer: <Answer> Question: which state is Kalamazoo in Answer: SELECT city.statename FROM city WHERE city.cityname===kalamazoo
NL2Bash Question: <Question> Answer: <Answer> Question: Add "execute" to the permissions of all directories in the home directory tree Answer: find -type d -exec chmod +x {}\{\}{ };
Table 14: An illustration of the effect of the label of demonstration, with three different types of input-label mapping of demonstration. The middle lines are demonstrations, and the last line is the model prediction. The model tends to learn the label of the demonstration.
NQ Test Question: When did computer become widespread in homes and schools? Answer:
Setting In-Context Demonstration Prediction
Clean Question: When did the internet first become available to the public? Answer: 1980s 1980s
Irrelevant Question: When did the internet first become available to the public? Answer: Crude Oil Crude Oil
Relevant Question: When did the internet first become available to the public? Answer: 2010s 2010s
WebQ Test Question: When did computer become widespread in homes and schools? Answer:
Setting In-Context Demonstration Prediction
Clean Question: When did the internet first become available to the public? Answer: 1980s 1980s
Irrelevant Question: When did the internet first become available to the public? Answer: Crude Oil Crude Oil
Relevant Question: When did the internet first become available to the public? Answer: 2010s 2010s
Table 15: An illustration of the effect of the label of demonstration, with three different types of input-label mapping of demonstration. The middle lines are demonstrations, and the last line is the model prediction. The model tends to learn the label of the demonstration.
SQuAD Input Support: The Super Bowl 50 Host Committee has vowed to be "the most giving Super Bowl ever", and will dedicate 25 percent of all money it raises for philanthropic causes in the Bay Area. The committee created the 50 fund as its philanthropic initiative and focuses on providing grants to aid with youth development, community investment and sustainable environments. Question: What is the name of the fund that focuses on youth, community and sustainable environments? Output:
Setting In-Context Demonstration Prediction
Clean Support: UNFPA works in partnership with governments, along with other United Nations agencies, communities, NGOs, foundations and the private sector, to raise awareness and mobilize the support needed to achieve its mission to promote the rights and health of women and young people. Question: With what sort of agencies does UNFPA work? Output: governments 25 percent
Irrelevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: Earth Earth
Relevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: tissues 50 fund
SCIQ Input Support: All forms of life are built of at least one cell. A cell is the basic unit of the structure and function of living things. Question: What are the smallest structural and functional units of all living organisms? Output:
Setting In-Context Demonstration Prediction
Clean Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: Cells Cells
Irrelevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: Earth Earth
Relevant Support: Cells are organized into tissues, tissues are organized into organs. Question: What is considered the smallest unit of the organ? Output: tissues tissues
Table 16: An illustration of the effect of the label of demonstration, with three different types of input-label mapping of demonstration. The middle lines are demonstrations, and the last line is the model prediction. The model tends to learn the label of the demonstration.
GeoQuery Test Question: How high is the highest point of Alabama? Answer:
Setting In-Context Demonstration
Clean Question: How high is the highest point in Montana? Answer: SELECT highlow.highest.elevation FROM highlow WHERE highlow. statename=’Montana’ Prediction: SELECT highlow.highest.elevation FROM highlow WHERE highlow. statename=’Alabama’
Irrelevant Question: How high is the highest point in Montana? Answer: more than 900 million Prediction: What are the highest point in Alabama
Relevant Question: How high is the highest point in Montana? Answer: SELECT city.cityname FROM city WHERE city.statename=’Montana’ Prediction: SELECT city.cityname FROM city WHERE city.statename=’Alabama’
NL2Bash Test Question: List all files in the current directory tree larger than 1000 kb Answer:
Setting In-Context Demonstration
Clean Question: Find and show all files in the current directory tree that are exactly 1000 kB. Answer: find . -size 1000k Prediction: find . -size +1000k
Irrelevant Question: Find and show all files in the current directory tree that are exactly 2000 kB? Answer: Arizona Department of Water Resources Prediction: 3 files
Relevant Question: Find and show all files in the current directory tree that are exactly 2000 kB? Answer: find . -type f -size 2000 -name "*.err" Prediction: find . -type f -size +1000 -name "*.err"

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: The abstract and introduction clearly state our scope, motivation, method, experimental results and contribution. See section 1.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: Our approach is suboptimal in cases of high noise rates due to the assumption that clean annotations are the majority in the dataset. In addition, we do not provide a theoretical analysis to show how noisy annotations affect ICL, which will be an interesting direction for future research. See section 7.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: Our approach is built on two natural assumptions that are naturally satisfied in the real world. In the literature, the assumptions are also supported by previous findings that paragraphs whose representations are close to each other share the same intrinsic task. See section 1.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: We detailedly introduce to our method and provided code and data to make our experiment reproducible. See section 4.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: We submit our code and data as supplemental materials. We provide instructions that contain the exact command and environment needed to run to reproduce the results.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: We specify all demonstration selection and test details in the section A.2. The full details can be found in our code which be provided as supplemental material.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [Yes]

  34. Justification: We report average results and standard deviation to illustrate the statistical significance of our method over 3 runs. We also conduct ablation studies to confirm the superiority of our method. See Table 2 and Figure 3.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: We run our experiments on 8 NVIDIA L40 GPUs. The detailed experiments compute resources and the cost is reported on section A.2.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics. https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: We conduct with the NeurIPS Code of Ethics, in the paper conform, in every respect.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [N/A]

  49. Justification: This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. See section 1.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: All pretrained model and dataset in this paper can be collected from Huggingface. See sections 5.1 and A.2.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: We cite the original paper that produced the code package or dataset. See sections A.2 and 5.1.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [Yes]

  64. Justification: The dataset used in this paper has been submitted as supplemental materials.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: Our paper does not involve crowdsourcing nor research with human subjects.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: Our paper does not involve crowdsourcing nor research with human subjects.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.