AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability

Sudhanshu Agrawal   Wonseok Jeon   Mingu Lee
Qualcomm AI Research
{sudhagra, wjeon, mingul}@qti.qualcomm.com
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
Abstract

Speculative decoding [1] is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.

1 Introduction

Large Language Models (LLMs) have been shown to have impressive performance on a variety of tasks including creative writing, summarization, and translation [2]. In particular, in recent years, several foundation models such as Llama2 [3], Llama3 [4], GPT-4 [5], and Claude [6] have shown to exceed expectations and generalize to coding [7], display agentic abilities [8], interpret images [9], and more. In all such systems, an LLM’s job remains the same - prediction of the next token via autoregressive generation. Autoregressive generation is fundamentally sequential in nature, since the prediction of the next token can only occur once the previous token has been generated. This reduces the ability of an LLM to parallelize, creating a bottleneck in the maximum number of generated tokens per second (TPS).

Speculative decoding techniques attempt to introduce parallelism to this system. Consider a scenario where the objective is to perform inference on a target model, say Llama2-7B. A smaller draft model with the same tokenizer as the target model, say TinyLLama-1B [10] is then chosen. Given a prompt, the draft model is allowed to run autoregressively, producing a set of candidate draft tokens. These tokens are then consumed by the target model which produces logits for each draft token, representing a probability distribution. Rejection sampling [1] then guarantees that the draft tokens which are accepted via this process will preserve the distribution of the original target model. Thus, by running a small model autoregressively and a large model in parallel, the system as a whole experiences an increase in average token rate.

However, a limiting factor of this system is that the number of draft tokens produced, referred to as the draft length (DL), if fixed over multiple rounds of drafting, can reduce the average token rate. This may be caused due to over-utilization of a poorly performing draft model, or symmetrically, because the draft model is under-utilized which does not allow the speculative decoding system to reach its maximum possible performance. For example, since most target models are large foundation models, designed to have high accuracy on a variety of tasks, it is likely that a smaller draft model finetuned to match the target model distribution for a particular task may have varying levels of accuracy when the target model switches tasks. In such scenarios, one can see a high variance in the number of accepted draft tokens. That is, in some drafting rounds, almost all the draft tokens are accepted, whereas in some, almost all are rejected. Figure 1 considers the creative writing task from the Dolly-15k dataset [11]. For this dataset, we see that for a standard speculative decoding system operating with various static draft lengths, the number of accepted tokens follows a normal distribution, with num-accepted-tokens taking on almost every value from 0 to max-draft-length0 to max-draft-length0\text{ to}\text{ max-draft-length}0 to max-draft-length with varying frequency. We include similar figures for the CNN-DM (summarization) [12] and WMT-19 (German-English translation) [13] datasets along with the details to set up these speculative decoding systems in Appendix A Figures 7(a), 7(b).

Refer to caption
Figure 1: The number of accepted tokens across drafting rounds displays a high variance, leading to under or over-utilization of the draft model in static draft length speculative decoding methods.

This motivates the need for an adaptive draft length speculative decoding system where the draft length at every drafting round can be determined on-the-fly. AdaEDL approaches this problem with a go-no-go strategy: while drafting tokens, at every iteration, AdaEDL establishes a draft-stopping criteria by approximating a lower bound on the expected acceptance probability of the drafted token by using the entropy of the draft model logits at that iteration. If the criteria is satisfied, drafting stops and verification by the target model is performed. We provide a theoretical basis to our proposed draft-stopping criteria formulation, deriving how it relates to the acceptance probability of the draft model. To validate our approach, we perform experiments across various maximum draft length settings, across multiple datasets and sampling temperature settings, and for various target and draft model choices, showing that this new draft-stopping strategy is more effective and robust than draft-stopping strategies which simply use the probability value of the most-likely token as a draft-stopping criteria - for example, those used in BiLD [14] and Draft & Verify [15]. Indeed, AdaEDL may be used to complement either of these algorithms. At the same time, our proposed system avoids the need to train an independent network to act as a binary classifier for early draft-stopping for a specific model and task, such as the approaches followed by SpecDec++ [16] and DISCO [17], which makes AdaEDL a simple and straightforward improvement to boost the token rate of speculative decoding LLM systems.

2 Problem setting and existing methods

Following a notation similar to the original speculative decoding formulation, let TM𝑇𝑀TMitalic_T italic_M be the target model whose inference we are trying to accelerate. Let DM𝐷𝑀DMitalic_D italic_M be a more efficient approximation of this target model, referred to as the draft model. Let us denote the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT token in the prompt by xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then the probability distribution of TM𝑇𝑀TMitalic_T italic_M at any given token xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT may be denoted as pTM(xt|x<t)subscript𝑝𝑇𝑀conditionalsubscript𝑥𝑡subscript𝑥absent𝑡p_{TM}(x_{t}|x_{<t})italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ). Similarly, the probability distribution of the draft model, DM𝐷𝑀DMitalic_D italic_M may be denoted as pDM(xt|x<t)subscript𝑝𝐷𝑀conditionalsubscript𝑥𝑡subscript𝑥absent𝑡p_{DM}(x_{t}|x_{<t})italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ). As noted in [1], several strategies such as nucleus sampling, top-k sampling, and others may each be viewed as sampling from an adjusted probability distribution. For notational simplicity, we refer to these probabilities, adjusted or not, as pTM(x)subscript𝑝𝑇𝑀𝑥p_{TM}(x)italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) and pDM(x)subscript𝑝𝐷𝑀𝑥p_{DM}(x)italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ). In this notation, we can now describe the various decoding techniques:

Autoregressive decoding

In this baseline technique, we only consider the target model and at every iteration, a new token is sampled as

xtpTM(|x<t)x_{t}\sim p_{TM}(\cdot|x_{<t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT )
Speculative decoding

The system first allows a draft model to consume the tokens x<tsubscript𝑥absent𝑡x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT. The draft model then autoregressively generates L𝐿Litalic_L draft tokens d1,d2,dLsubscript𝑑1subscript𝑑2subscript𝑑𝐿d_{1},d_{2},\cdots d_{L}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, where L𝐿Litalic_L is the maximum draft length. This can be viewed as sampling a token

dipDM(|x<t,d1,,di1)d_{i}\sim p_{DM}(\cdot|x_{<t},d_{1},\cdots,d_{i-1})italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )

After drafting, the target model evaluates these draft tokens in parallel, producing probabilities pTM(x)subscript𝑝𝑇𝑀𝑥p_{TM}(x)italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ). The draft token disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is accepted if pDM(x)pTM(x)subscript𝑝𝐷𝑀𝑥subscript𝑝𝑇𝑀𝑥p_{DM}(x)\leq p_{TM}(x)italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) ≤ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ), producing a token xi=disubscript𝑥𝑖subscript𝑑𝑖x_{i}=d_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Otherwise, it is rejected with probability 1pTM(x)/pDM(x)1subscript𝑝𝑇𝑀𝑥subscript𝑝𝐷𝑀𝑥1-p_{TM}(x)/p_{DM}(x)1 - italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) / italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) and a new token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is sampled from an adjusted distribution xipTM(x)=norm(max(0,pTM(x)pDM(x)))similar-tosubscript𝑥𝑖superscriptsubscript𝑝𝑇𝑀𝑥𝑛𝑜𝑟𝑚𝑚𝑎𝑥0subscript𝑝𝑇𝑀𝑥subscript𝑝𝐷𝑀𝑥x_{i}\sim p_{TM}^{\prime}(x)=norm(max(0,p_{TM}(x)-p_{DM}(x)))italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = italic_n italic_o italic_r italic_m ( italic_m italic_a italic_x ( 0 , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) - italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) ) ). A token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sampled via this method, referred to as rejection sampling, is guaranteed to follow the probability distribution of the original target model, that is, xipTM(x)similar-tosubscript𝑥𝑖subscript𝑝𝑇𝑀𝑥x_{i}\sim p_{TM}(x)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) [1].

To further improve this process, the draft length L𝐿Litalic_L may be made adaptive, to avoid the cost of drafting tokens that are likely to be rejected anyway.

Adaptive draft length via maximum confidence speculative decoding

Some systems consider the top-1 probability of the drafted logits. That is, at a given drafting stage, the system considers the token with the highest probability maxxpDM(x|x<t,d1,,di1)subscript𝑥subscript𝑝𝐷𝑀conditional𝑥subscript𝑥absent𝑡subscript𝑑1subscript𝑑𝑖1\max_{x}p_{DM}(x|x_{<{t}},d_{1},\cdots,d_{i-1})roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ). If this value, the maximum confidence among any possible token, is less than a threshold, λ𝜆\lambdaitalic_λ, the system is considered to be under-confident and drafting stops.

maxxpDM(x)<λsubscript𝑥subscript𝑝𝐷𝑀𝑥𝜆\max_{x}p_{DM}(x)<\lambdaroman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) < italic_λ

This is a simple method to avoid wastage during the drafting phase and is effectively employed alongside other techniques in Draft & Verify [15] and BiLD [14]. However, this scheme fails to take into account the overall probability distribution during drafting. This motivates the need for a new drafting stopping mechanism that considers the probabilities of all possible tokens while making a go-no-go decision.

Adaptive draft length via a trained predictor

Several works involve the training of a small network to act as a predictor to determine optimal draft lengths, for example, SpecDec++ [16] which trains a ResNet and DISCO [17] which trains a FFN as a predictor for early draft stopping. AdaEDL is distinct from these method and specifically aims to be training and parameter-free, similar to draft stopping via maximum confidence, to allow for greater generalization across datasets and models.

3 Adaptive entropy-based draft length speculative decoding (AdaEDL)

AdaEDL operates in the same setup as the above adaptive draft length techniques but instead, establishes a stopping criteria using an entropy-based lower bound on the token acceptance probability. If we consider the probability distribution of the draft model, pDM(x)subscript𝑝𝐷𝑀𝑥p_{DM}(x)italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) and its corresponding entropy HDM(x)subscript𝐻𝐷𝑀𝑥H_{DM}(x)italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ), we see that 1γHDM(x)1𝛾subscript𝐻𝐷𝑀𝑥1-\sqrt{\gamma H_{DM}(x)}1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG serves as an approximate lower bound on the expected acceptance rate. This allows us to formulate a drafting system where drafting is stopped if this criteria falls below a threshold λ𝜆\lambdaitalic_λ, indicating that the expected acceptance rate is also below this threshold:

1γHDM(x)<λ1𝛾subscript𝐻𝐷𝑀𝑥𝜆1-\sqrt{\gamma H_{DM}(x)}<\lambda1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG < italic_λ

Here, γ>0𝛾0\gamma>0italic_γ > 0 is a hyperparameter. We include a detailed derivation of this lower-bound approximation in Appendix B. Additionally, we visualize AdaEDL alongside the baseline speculative decoding systems mentioned above in Figure 2, highlighting the novel early draft-stopping criteria we introduce.

Refer to caption
Figure 2: AdaEDL performs adaptive early-draft-stopping via an entropy-based lower bound on the expected acceptance rate.
Improving λ𝜆\lambdaitalic_λ via dynamic updates

The stopping threshold λ𝜆\lambdaitalic_λ can be further optimized by making it responsive to the current acceptance rate statistics of the system. In particular, we aim to achieve a target acceptance rate α𝛼\alphaitalic_α - increasing the stopping threshold if it is not currently being reached and decreasing it otherwise. By maintaining an exponential moving average of the acceptance rate and using it to update the stopping threshold in this manner, we can achieve better performance over longer generations. We modify the threshold update strategy described in [15] and describe it in Algorithm 1.

Algorithm 1 Dynamic updates for stopping threshold λ𝜆\lambdaitalic_λ
1:Lmax_draft_length𝐿max_draft_lengthL\leftarrow\text{max\_draft\_length}italic_L ← max_draft_length
2:ndraftednum_drafted_tokenssubscript𝑛𝑑𝑟𝑎𝑓𝑡𝑒𝑑num_drafted_tokensn_{drafted}\leftarrow\text{num\_drafted\_tokens}italic_n start_POSTSUBSCRIPT italic_d italic_r italic_a italic_f italic_t italic_e italic_d end_POSTSUBSCRIPT ← num_drafted_tokens
3:naccnum_accepted_tokenssubscript𝑛𝑎𝑐𝑐num_accepted_tokensn_{acc}\leftarrow\text{num\_accepted\_tokens}italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT ← num_accepted_tokens
4:icur_drafting_round𝑖cur_drafting_roundi\leftarrow\text{cur\_drafting\_round}italic_i ← cur_drafting_round
5:λcur_stopping_threshold𝜆cur_stopping_threshold\lambda\leftarrow\text{cur\_stopping\_threshold}italic_λ ← cur_stopping_threshold
6:while  i1𝑖1i\geq 1italic_i ≥ 1  do
7:     ARi𝐴subscript𝑅𝑖absentAR_{i}\leftarrowitalic_A italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← nacc/ndraftedsubscript𝑛𝑎𝑐𝑐subscript𝑛𝑑𝑟𝑎𝑓𝑡𝑒𝑑n_{acc}/n_{drafted}italic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_d italic_r italic_a italic_f italic_t italic_e italic_d end_POSTSUBSCRIPT \triangleright Calculate the current acceptance rate
8:     ARβ1AR+(1β1)ARi𝐴𝑅subscript𝛽1𝐴𝑅1subscript𝛽1𝐴subscript𝑅𝑖AR\leftarrow\beta_{1}AR+(1-\beta_{1})AR_{i}italic_A italic_R ← italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_A italic_R + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_A italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9:     if AR<α𝐴𝑅𝛼AR<\alphaitalic_A italic_R < italic_α then \triangleright Not yet meeting the target acceptance rate
10:         λλ+ϵsuperscript𝜆𝜆italic-ϵ\lambda^{\prime}\leftarrow\lambda+\epsilonitalic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_λ + italic_ϵ
11:     else if naccLsubscript𝑛𝑎𝑐𝑐𝐿n_{acc}\neq Litalic_n start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT ≠ italic_L then \triangleright Not yet drafting max possible tokens
12:         λλϵsuperscript𝜆𝜆italic-ϵ\lambda^{\prime}\leftarrow\lambda-\epsilonitalic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_λ - italic_ϵ
13:     else
14:         λλsuperscript𝜆𝜆\lambda^{\prime}\leftarrow\lambdaitalic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_λ
15:     end if
16:     λβ2λ+(1β2)λ𝜆subscript𝛽2𝜆1subscript𝛽2superscript𝜆\lambda\leftarrow\beta_{2}\lambda+(1-\beta_{2})\lambda^{\prime}italic_λ ← italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT \triangleright Update the stopping threshold
17:end while

Hyperparameters: α𝛼\alphaitalic_α = target acceptance rate, ϵitalic-ϵ\epsilonitalic_ϵ = step size, β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are used to calculate the exponential moving averages. Refer to Section 4.4 for additional details.

We further discuss the hyperparameters introduced here, γ𝛾\gammaitalic_γ, λ𝜆\lambdaitalic_λ, β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, α𝛼\alphaitalic_α, and ϵitalic-ϵ\epsilonitalic_ϵ, their typical values and ranges, as well as the sensitivity of AdaEDL to these hyperparameters in Section 4.4.

4 Experimental Results

In all of the following results, AdaEDL refers to our proposed method, described in in Section 3 with dynamically updated stopping thresholds as described by Algorithm 1. Max-Confidence-SPD refers to a speculative decoding system which implements the early draft-stopping strategy based on the token with the highest probability described in Section 2, which is further improved by using the same dynamic threshold strategy to have a fair comparison with AdaEDL. Base-SPD refers to vanilla speculative decoding with no early draft-stopping strategy employed and Autoregressive refers to standard autoregressive decoding of LLMs which is the baseline that speculative decoding systems attempt to improve on.

We evaluate the performance of AdaEDL across various datasets and tasks: Dolly-15k (creative writing) [11], WMT-19 (German-English translation) [13], and CNN-DM (summarization) [12]. The entire Dolly test dataset (708 samples), the entire WMT-19 test dataset (600 samples), and the first 1000 samples of the CNN-DM test dataset were used to ensure a large sample size. All token rate (TPS) numbers reported are calculated as the averages over all samples in each dataset. All experiments were performed on a single NVIDIA A100 with 80GB of GPU memory in FP32 precision. We acknowledge here that results may vary on different hardware which may lead to faster or slower inference of a draft or target model and in particular, their relative speed. We discuss this point in Appendix B.2.

We show in the following experiments that AdaEDL will always either exceed or match the performance of other decoding systems while also proving less sensitive to chosen hyperparameters and robust to the system’s settings.

4.1 Performance with fast finetuned draft models

In the following experiments, we use Llama2-7B as the target model with sampling temperature set to 0.7. The draft model, Llama2-Drafter-115M, is a 115M parameter model, distilled and finetuned using direct alignment [18] to closely match the target model distribution . As a result, even Base-SPD results in higher acceptance rates than one would observe from an off-the-shelf draft model.

In Table 1, we indicate the performance of AdaEDL across the various datasets and tasks mentioned above - Dolly-15k, WMT-19, and CNN-DM and report the average token rate on each of these datasets. We observe that AdaEDL consistently matches or beats both Max-Confidence-SPD and Base-SPD on all tasks by a competitive margin. In particular, AdaEDL has a significant advantage in the open-ended CNN-DM summarization task.

Table 1: Performance of AdaEDL vs Max-Confidence-SPD vs Base-SPD for various maximum draft lengths and datasets. Target model = Llama2-7B, draft model = Llama2-Drafter-115M, sampling temperature = 0.7.
Max DL = 16161616 CNN-DM Dolly-15k WMT-19
Autoregressive 25.7425.7425.7425.74 29.0229.0229.0229.02 29.8029.8029.8029.80
Base-SPD 36.3036.3036.3036.30 32.1032.1032.1032.10 22.3022.3022.3022.30
Max-Confidence-SPD 49.5049.5049.5049.50 55.8055.8055.8055.80 43.7043.7043.7043.70
AdaEDL 54.1054.10\mathbf{54.10}bold_54.10 56.1056.10\mathbf{56.10}bold_56.10 43.9043.90\mathbf{43.90}bold_43.90
Max DL = 7777 CNN-DM Dolly-15 WMT-19
Autoregressive 25.7425.7425.7425.74 29.0229.0229.0229.02 29.8029.8029.8029.80
Base-SPD 51.5051.5051.5051.50 47.6047.6047.6047.60 32.7032.7032.7032.70
Max-Confidence-SPD 53.5053.5053.5053.50 56.6056.6056.6056.60 45.1045.1045.1045.10
AdaEDL 56.9056.90\mathbf{56.90}bold_56.90 57.1057.10\mathbf{57.10}bold_57.10 45.2045.20\mathbf{45.20}bold_45.20
Max DL = 3333 CNN-DM Dolly-15k WMT-19
Autoregressive 25.7425.7425.7425.74 29.0229.0229.0229.02 29.8029.8029.8029.80
Base-SPD 54.1054.1054.1054.10 54.7054.7054.7054.70 40.9040.9040.9040.90
Max-Confidence-SPD 53.1053.1053.1053.10 55.7055.7055.7055.70 42.5042.5042.5042.50
AdaEDL 55.7055.70\mathbf{55.70}bold_55.70 55.8055.80\mathbf{55.80}bold_55.80 45.0045.00\mathbf{45.00}bold_45.00

4.2 Performance of AdaEDL with expensive draft models

A smaller draft model may be faster at drafting, but sacrifices acceptance rate. Symmetrically, a larger draft model may lead to higher acceptance rates, but lower overall inference speed due to its own cost of inference.

4.2.1 Pythia

The Pythia family of models [19] consists of models of sizes ranging from 70M to 12B parameters. As a result, they represent an ideal set of models to test the performance of various decoding systems when the ratio of the target model size to the draft model size is varied - that is, when |TM||DM|𝑇𝑀𝐷𝑀\frac{|TM|}{|DM|}divide start_ARG | italic_T italic_M | end_ARG start_ARG | italic_D italic_M | end_ARG is varied.

Figure 3 sets Pythia-6.9B as the target model and considers the performance of Pythia-1B, Pythia-410M, Pythia-160M, and Pythia-70M as draft models for the CNN-DM (summarization) task. We demonstrate that adaptive draft length techniques enable us to use upto 10×10\times10 × larger draft models in a scenario where otherwise, the maximum token rate achievable with Base-SPD would have been 32.232.232.232.2 tokens per second (TPS). We see that when Pythia-160M or Pythia-70M is used as the draft model, autoregressive decoding outperforms all speculative decoding methods with a token rate of 29.729.729.729.7 TPS. Without adaptive draft length techniques, the maximum token rate achievable through Base-SPD is 32.232.232.232.2 TPS with Pythia-1B as a draft model for an 8%percent88\%8 % speedup. However, when Pythia-1B is used to draft for Pythia-6.9B with AdaEDL, the system achieves a token rate of 46.446.446.446.4 TPS for a 56%percent5656\%56 % speedup. Thus, AdaEDL enables speculative decoding in a scenario where autoregressive decoding would normally be the candidate method. This also opens up the possibility of finetuning larger draft models, which would presumably lead to higher acceptance rates without sacrificing performance if AdaEDL were to be enabled. We believe that this is a promising direction of investigation for future work.

Refer to caption
Figure 3: Target model = Pythia-6.9B with various |TM|/|DM|𝑇𝑀𝐷𝑀{|TM|}/{|DM|}| italic_T italic_M | / | italic_D italic_M | ratios demonstrates that AdaEDL opens up the possibility of using speculative decoding with much larger draft models. Max draft length = 7777, sampling temperature = 1.01.01.01.0, dataset = CNN-DM.

4.2.2 TinyLlama

Motivated by the results in Section 4.2, Figure 4 shows the performance of various decoding methods when the target model is Llama2-7B and the draft model is a standard 1B model - TinyLlama-1B [10]. We see that in this case, AdaEDL and Max-Confidence-SPD increase the token rate by 43%percent4343\%43 % as compared to autoregressive decoding, while Base-SPD negatively impacts performance, reducing the token rate by 16%percent1616\%16 %.

Refer to caption
Figure 4: Target model = LLama2-7B, draft model = TinyLLama-1B, max draft length = 7777, sampling temperature = 1.01.01.01.0, when used in an adaptive draft length decoding scheme, surpass autoregressive and Base-SPD performance. Dataset = CNN-DM.

4.3 Performance of AdaEDL across sampling temperatures

In cases where the target distribution is difficult to predict, such as when the chosen sampling temperature is high, we see that Base-SPD is only able to produce modest gains in token rate - sometimes even resulting in poorer performance than autoregressive decoding. In Figure 5 we see that as we increase the sampling temperature from 0.70.70.70.7 to 1.71.71.71.7 on the CNN-DM dataset, the token rate of Base-SPD drops 57%percent5757\%57 % (for draft length 3333), and ends up being lower than standard autoregressive decoding. On the other hand, even at high sampling temperatures, AdaEDL provides an 8%percent88\%8 % boost in token rate over the autoregressive baseline. AdaEDL also consistently outperforms the other 3 decoding methods across lower sampling temperatures and across maximum draft length settings. We perform similar experiments on the Dolly-15k and WMT-19 datasets in Appendix C Figures 8(a), 8(b), and see similarly consistent performance from AdaEDL.

Refer to caption
Figure 5: AdaEDL boosts token rate even in high temperature scenarios where Base-SPD is insufficient. Target model = Llama2-7B, draft model = Llama2-Drafter-115M, dataset = CNN-DM.

4.4 Controllability and sensitivity of AdaEDL to hyperparameters

Entropy factor (γ𝛾\gammaitalic_γ)

In all our experiments, we set γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 in the equation 1γHDM(x)1𝛾subscript𝐻𝐷𝑀𝑥1-\sqrt{\gamma H_{DM}(x)}1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG. We see that this simple approximation described in Appendix B, when coupled with the dynamic stopping threshold strategy described in Algorithm 1, produces strong results. Further hyperparameter search may be possible to improve this value for a particular dataset and we observe that values in the range γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) typically work best. It may also be possible to dynamically update γ𝛾\gammaitalic_γ using the observed target model distribution as the system runs. We defer these investigations to future works.

Threshold update hyperparameters (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ, α𝛼\alphaitalic_α)

In all our experiments, we set β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9, ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01, and α=0.9𝛼0.9\alpha=0.9italic_α = 0.9, following [15], for the threshold update step for both AdaEDL and Max-Confidence-SPD.

Refer to caption
Figure 6: AdaEDL performs consistently better than Max-Confidence-SPD even at suboptimal λ𝜆\lambdaitalic_λ. Target model = Llama2-7B, draft model = Llama2-Drafter-115M, dataset = CNN-DM, max draft length = 7777.
Stopping threshold (λ𝜆\lambdaitalic_λ)

The most significant hyperparameter in adaptive draft length methods is the early draft-stopping threshold, λ𝜆\lambdaitalic_λ. In all our experiments, λ𝜆\lambdaitalic_λ is updated dynamically according to the scheme described in Algorithm 1. Regardless, we find that both AdaEDL and Max-Confidence-SPD are sensitive to the initial choice of λ𝜆\lambdaitalic_λ. We hypothesize that this may be due to the fact that many output generations are short in length, which may not result in enough drafting rounds for the system to converge to an optimal threshold. In Figure 6 we see that AdaEDL, even for sub-optimal λ𝜆\lambdaitalic_λ choices, consistently outperforms Max-Confidence-SPD when its λ𝜆\lambdaitalic_λ is also chosen sub-optimally. At the same time, AdaEDL performs better than this baseline if optimal λ𝜆\lambdaitalic_λ are chosen for both methods as we see marked by the dashed line and in Section 4.1. We conduct experiments across datasets (CNN-DM, Dolly-15k, WMT-19), sampling temperatures (0.70.70.70.7, 1.01.01.01.0, 1.31.31.31.3, 1.71.71.71.7), and maximum draft length settings (3333, 7777, 16161616), and observe similar trends in Appendix D Figures 9(a), 9(b), 9(c), showing that AdaEDL can consistently boost token rate without the need for fine-grained hyperparameter search.

5 Conclusion

In this work, we present AdaEDL, an early stopping criteria for drafting in speculative decoding systems which uses the entropy of the draft model to estimate a lower bound on the current token’s acceptance rate. We show the efficacy of this new method across datasets, sampling temperatures, draft lengths, and choice of target and draft models, whether fine-tuned or off-the-shelf. AdaEDL boosts the performance of existing speculative decoding systems while also enabling efficient usage of much larger draft models which, if finetuned in future works, could potentially result in even more impressive gains in token rate. AdaEDL is training and parameter-free, is not dependent on a given dataset, and also offers a relaxed choice in hyperparameters, making it a simple, plug-and-play improvement to a variety of pre-existing speculative decoding LLM systems.

References

  • Leviathan et al. [2023] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023. URL https://arxiv.org/abs/2211.17192.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
  • Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
  • OpenAI et al. [2024] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
  • Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/abs/2212.08073.
  • Rozière et al. [2024] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
  • Liu et al. [2023a] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2023a. URL https://arxiv.org/abs/2308.03688.
  • Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b. URL https://arxiv.org/abs/2304.08485.
  • Zhang et al. [2024a] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024a. URL https://arxiv.org/abs/2401.02385.
  • Conover et al. [2023] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  • See et al. [2017] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://www.aclweb.org/anthology/P17-1099.
  • [13] Wikimedia Foundation. Acl 2019 fourth conference on machine translation (wmt19), shared task: Machine translation of news. URL http://www.statmt.org/wmt19/translation-task.html.
  • Kim et al. [2023] Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder, 2023. URL https://arxiv.org/abs/2302.07863.
  • Zhang et al. [2024b] Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding, 2024b. URL https://arxiv.org/abs/2309.08168.
  • Huang et al. [2024] Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2024. URL https://arxiv.org/abs/2405.19715.
  • Mamou et al. [2024] Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models, 2024. URL https://arxiv.org/abs/2405.04304.
  • Goel et al. [2024] Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. Direct alignment of draft model for speculative decoding with chat-fine-tuned llms, 2024. URL https://arxiv.org/abs/2403.00858.
  • Biderman et al. [2023] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373.
  • Levin et al. [2008] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times. 2008. URL https://api.semanticscholar.org/CorpusID:117035435.
  • Rioul [2024] Olivier Rioul. A historical perspective on schützenberger-pinsker inequalities (extended version). Information Geometry, May 2024. ISSN 2511-249X. doi: 10.1007/s41884-024-00138-z. URL http://dx.doi.org/10.1007/s41884-024-00138-z.
  • Arora et al. [2023] Kushal Arora, Timothy J. O’Donnell, Doina Precup, Jason Weston, and Jackie C. K. Cheung. The stable entropy hypothesis and entropy-aware decoding: An analysis and algorithm for robust natural language generation, 2023. URL https://arxiv.org/abs/2302.06784.

Appendix A Acceptance rate variance for standard speculative decoding systems

We see that across tasks like summarization (CNN-DM) in Figure 7(a), translation (WMT-19) in Figure 7(b), and creative writing (Dolly-15k) in Figure 7(c), standard speculative decoding systems display a large variance in the number of tokens accepted per drafting round. This is observed across draft lengths 3333, 7777, and 16161616. This effect is particularly pronounced in the CNN-DM and Dolly-15k datasets, in which we observe standard deviations of 2similar-toabsent2\sim 2∼ 2 tokens even within a maximum draft length of only 7777 (i.e., 29%percent2929\%29 % standard deviation), opening up room for significant optimization. Experiments are conducted with target model chosen as Llama2-7B and a 115M draft model finetuned via direct alignment with the target model distribution [18].

Refer to caption
(a) Dataset: CNN-DM
Refer to caption
(b) Dataset: WMT-19
Refer to caption
(c) Dataset: Dolly-15k
Figure 7: Across datasets and draft lengths, the number of accepted tokens across drafting rounds displays a high variance, leading to under or over-utilization of the draft model in static draft length methods. Target model = Llama2-7B, draft model = Llama2-Drafter-115M, sampling temperature = 1.01.01.01.0.

Appendix B Derivation of entropy-based draft-stopping criteria

Let pDM(x)subscript𝑝𝐷𝑀𝑥p_{DM}(x)italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) and pTM(x)subscript𝑝𝑇𝑀𝑥p_{TM}(x)italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) be the currently observed probability distributions of the draft and target model given some prefix x<tsubscript𝑥absent𝑡x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT. Following [1], the acceptance probability, β𝛽\betaitalic_β of a token drafted from pDM(x)subscript𝑝𝐷𝑀𝑥p_{DM}(x)italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) defined via rejection sampling is

β=𝔼xpDM(x){1if pDM(x)pTM(x)pTM(x)pDM(x)if pDM(x)>pTM(x)𝛽subscript𝔼similar-to𝑥subscript𝑝𝐷𝑀𝑥cases1if pDM(x)pTM(x)subscript𝑝𝑇𝑀𝑥subscript𝑝𝐷𝑀𝑥if pDM(x)>pTM(x)\beta=\mathbb{E}_{x\sim p_{DM}(x)}\begin{cases*}1&if $p_{DM}(x)\leq p_{TM}(x)$% \\ \frac{p_{TM}(x)}{p_{DM}(x)}&if $p_{DM}(x)>p_{TM}(x)$\end{cases*}italic_β = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT { start_ROW start_CELL 1 end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) ≤ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) > italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) end_CELL end_ROW

On discrete domains, the total variation distance between these distributions is defined in the following manner [20] :

TVD(pDMpTM)=12x|pTM(x)pDM(x)|𝑇𝑉𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀12subscript𝑥subscript𝑝𝑇𝑀𝑥subscript𝑝𝐷𝑀𝑥TVD(p_{DM}\|p_{TM})=\dfrac{1}{2}\sum_{x}\left\lvert p_{TM}(x)-p_{DM}(x)\right\rvertitalic_T italic_V italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ) - italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) |

We see in [1] that TVD(DMTM)𝑇𝑉𝐷conditional𝐷𝑀𝑇𝑀TVD(DM\|TM)italic_T italic_V italic_D ( italic_D italic_M ∥ italic_T italic_M ) is related to the acceptance probability, β𝛽\betaitalic_β as

β=1TVD(pDMpTM)𝛽1𝑇𝑉𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀\beta=1-TVD(p_{DM}\|p_{TM})italic_β = 1 - italic_T italic_V italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT )
TVD(pDMpTM)=1βabsent𝑇𝑉𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀1𝛽\implies TVD(p_{DM}\|p_{TM})=1-\beta⟹ italic_T italic_V italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) = 1 - italic_β

Moreover, by Pinsker’s inequality [21], we may relate the total variation distance to the Kullback-Liebler divergence, KLD(pDMpTM)𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀KLD(p_{DM}\|p_{TM})italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT )

TVD(pDMpTM)12KLD(pDMpTM)𝑇𝑉𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀12𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀TVD(p_{DM}\|p_{TM})\leq\sqrt{\frac{1}{2}KLD(p_{DM}\|p_{TM})}italic_T italic_V italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) end_ARG
1β12KLD(pDMpTM)absent1𝛽12𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀\implies 1-\beta\leq\sqrt{\frac{1}{2}KLD(p_{DM}\|p_{TM})}⟹ 1 - italic_β ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) end_ARG

Giving us,

112KLD(pDMpTM)β112𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀𝛽1-\sqrt{\frac{1}{2}KLD(p_{DM}\|p_{TM})}\leq\beta1 - square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) end_ARG ≤ italic_β

Further, KLD(pDMpTM)𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀KLD(p_{DM}\|p_{TM})italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) relates to the cross-entropy CE(pDM,pTM)𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀CE(p_{DM},p_{TM})italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) via

KLD(pDMpTM)=CE(pDM,pTM)HDM𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀subscript𝐻𝐷𝑀KLD(p_{DM}\|p_{TM})=CE(p_{DM},p_{TM})-H_{DM}italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) = italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT

where HDMsubscript𝐻𝐷𝑀H_{DM}italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT is the entropy of the draft model distribution.

Now, let us note, while drafting, we do not yet have access to the target model distribution pTM(x)subscript𝑝𝑇𝑀𝑥p_{TM}(x)italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ( italic_x ). We do know, however, that since KLD0𝐾𝐿𝐷0KLD\geq 0italic_K italic_L italic_D ≥ 0, we have that

CE(pDM,pTM)HDM𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀subscript𝐻𝐷𝑀CE(p_{DM},p_{TM})\geq H_{DM}italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) ≥ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT

In this work, we choose a linear approximation of CE(pDM,pTM)𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀CE(p_{DM},p_{TM})italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) via a positive factor γ(0,1)superscript𝛾01\gamma^{\prime}\in(0,1)italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , 1 )

CE(pDM,pTM)=(1+γ)HDM𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀1superscript𝛾subscript𝐻𝐷𝑀CE(p_{DM},p_{TM})=(1+\gamma^{\prime})H_{DM}italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) = ( 1 + italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT

This is motivated by the observation that in LLM systems, most of the variation seen in the cross-entropy between the draft and target model occurs due to the high entropy of the draft model. LLMs suitable to be target models follow the stable entropy hypothesis [22] with reasonable generations lying in a narrow entropy band.

Substituting this approximation, we have that

112KLD(pDMpTM)112𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀\displaystyle 1-\sqrt{\frac{1}{2}KLD\left(p_{DM}\|p_{TM}\right)}1 - square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) end_ARG βabsent𝛽\displaystyle\leq\beta≤ italic_β
112KLD(pDMpTM)112𝐾𝐿𝐷conditionalsubscript𝑝𝐷𝑀subscript𝑝𝑇𝑀\displaystyle 1-\sqrt{\frac{1}{2}KLD\left(p_{DM}\|p_{TM}\right)}1 - square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_K italic_L italic_D ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) end_ARG =112(CE(pDM,pTM)HDM)absent112𝐶𝐸subscript𝑝𝐷𝑀subscript𝑝𝑇𝑀subscript𝐻𝐷𝑀\displaystyle=1-\sqrt{\frac{1}{2}\left(CE(p_{DM},p_{TM})-H_{DM}\right)}= 1 - square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_C italic_E ( italic_p start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_T italic_M end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ) end_ARG
112((1+γ)HDMHDM)absent1121superscript𝛾subscript𝐻𝐷𝑀subscript𝐻𝐷𝑀\displaystyle\approx 1-\sqrt{\frac{1}{2}\left(\left(1+\gamma^{\prime}\right)H_% {DM}-H_{DM}\right)}≈ 1 - square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ( 1 + italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT - italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ) end_ARG
=1γHDMabsent1𝛾subscript𝐻𝐷𝑀\displaystyle=1-\sqrt{\gamma H_{DM}}= 1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT end_ARG

Thus, we see that the value 1γHDM1𝛾subscript𝐻𝐷𝑀1-\sqrt{\gamma H_{DM}}1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT end_ARG acts as an approximate lower bound on the acceptance probability β𝛽\betaitalic_β. By stopping the drafting of a new token if our lower-bound estimate falls below a threshold λ𝜆\lambdaitalic_λ, we attempt to ensure that the acceptance probability of the potential new token will be greater than this threshold. If the acceptance probability does not meet our threshold, we choose not to draft the next token.

Thus, a draft-stopping criteria

1γHDM(x)<λ1𝛾subscript𝐻𝐷𝑀𝑥𝜆1-\sqrt{\gamma H_{DM}(x)}<\lambda1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG < italic_λ

implies that if drafting continues because 1γHDM(x)λ1𝛾subscript𝐻𝐷𝑀𝑥𝜆1-\sqrt{\gamma H_{DM}(x)}\geq\lambda1 - square-root start_ARG italic_γ italic_H start_POSTSUBSCRIPT italic_D italic_M end_POSTSUBSCRIPT ( italic_x ) end_ARG ≥ italic_λ, then we have an approximate lower bound on the acceptance probability of the drafted token via λ𝜆\lambdaitalic_λ, i.e.,

βλ𝛽𝜆\beta\geq\lambdaitalic_β ≥ italic_λ

B.1 Computational cost

The cost of computing entropy is O(N)𝑂𝑁O(N)italic_O ( italic_N ) on a single thread where N𝑁Nitalic_N is the size of the vocabulary. That said, this operation is highly parallelizable since the N𝑁Nitalic_N operations are independent. Thus, the overhead of AdaEDL is at most O(N)𝑂𝑁O(N)italic_O ( italic_N ), but may be significantly reduced if implemented efficiently.

B.2 Impact of hardware chosen

An additional consideration is that the speedup achievable by a speculative decoding system depends on the cost of running the draft model, which depends on its size and the nature of the hardware it runs on. For example, a larger draft model may have a higher acceptance rate, but is also more expensive to run. An ideal system is one that balances these factors, taking into account the expected acceptance rate and computational cost incurred on a particular hardware. Future work that studies draft model cost across various processors would be valuable to designing such a system.

Appendix C Effect of target model sampling temperature

AdaEDL consistently outperforms the other 3 decoding methods across sampling temperatures. This trend is reflected across maximum draft lengths 3333, 7777, 16161616 and across datasets as seen for the Dolly-15k dataset (Figure 8(a)), the WMT-19 dataset (Figure 8(b)), and the CNN-DM dataset (Figure 8(c)).

Refer to caption
(a) Dataset = Dolly-15k (creative writing)
Refer to caption
(b) Dataset = WMT-19 (German-English translation)
Refer to caption
(c) Dataset = CNN-DM (summarization)
Figure 8: AdaEDL boosts token rate even in high temperature scenarios where Base-SPD is insufficient. Target Model = Llama2-7B, Draft Model = Llama2-Drafter-115M.

Appendix D Sensitivity to stopping thresholds

We see that AdaEDL is less sensitive to the choice of the stopping threshold λ𝜆\lambdaitalic_λ, outperforming Max-Confidence-SPD even when suboptimal λ𝜆\lambdaitalic_λ is chosen. This is reflected across temperatures and datasets as seen in Figures 9(a), 9(b), 9(c).

Refer to caption
(a) Dataset = CNN-DM (summarization)
Refer to caption
(b) Dataset = Dolly-15k (creative writing)
Refer to caption
(c) Dataset = WMT-19 (German-English translation)
Figure 9: AdaEDL maintains a margin over other methods even for sub-optimal stopping threshold choices. Max DL = 7777, TM = Llama2-7B, DM = Llama2-Drafter-115M.