AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
Abstract
Speculative decoding [1] is a powerful technique that attempts to circumvent the autoregressive constraint of modern Large Language Models (LLMs). The aim of speculative decoding techniques is to improve the average inference time of a large, target model without sacrificing its accuracy, by using a more efficient draft model to propose draft tokens which are then verified in parallel. The number of draft tokens produced in each drafting round is referred to as the draft length and is often a static hyperparameter chosen based on the acceptance rate statistics of the draft tokens. However, setting a static draft length can negatively impact performance, especially in scenarios where drafting is expensive and there is a high variance in the number of tokens accepted. Adaptive Entropy-based Draft Length (AdaEDL) is a simple, training and parameter-free criteria which allows for early stopping of the token drafting process by approximating a lower bound on the expected acceptance probability of the drafted token based on the currently observed entropy of the drafted logits. We show that AdaEDL consistently outperforms static draft-length speculative decoding by 10%-57% as well as other training-free draft-stopping techniques by upto 10% in a variety of settings and datasets. At the same time, we show that AdaEDL is more robust than these techniques and preserves performance in high-sampling-temperature scenarios. Since it is training-free, in contrast to techniques that rely on the training of dataset-specific draft-stopping predictors, AdaEDL can seamlessly be integrated into a variety of pre-existing LLM systems.
1 Introduction
Large Language Models (LLMs) have been shown to have impressive performance on a variety of tasks including creative writing, summarization, and translation [2]. In particular, in recent years, several foundation models such as Llama2 [3], Llama3 [4], GPT-4 [5], and Claude [6] have shown to exceed expectations and generalize to coding [7], display agentic abilities [8], interpret images [9], and more. In all such systems, an LLM’s job remains the same - prediction of the next token via autoregressive generation. Autoregressive generation is fundamentally sequential in nature, since the prediction of the next token can only occur once the previous token has been generated. This reduces the ability of an LLM to parallelize, creating a bottleneck in the maximum number of generated tokens per second (TPS).
Speculative decoding techniques attempt to introduce parallelism to this system. Consider a scenario where the objective is to perform inference on a target model, say Llama2-7B. A smaller draft model with the same tokenizer as the target model, say TinyLLama-1B [10] is then chosen. Given a prompt, the draft model is allowed to run autoregressively, producing a set of candidate draft tokens. These tokens are then consumed by the target model which produces logits for each draft token, representing a probability distribution. Rejection sampling [1] then guarantees that the draft tokens which are accepted via this process will preserve the distribution of the original target model. Thus, by running a small model autoregressively and a large model in parallel, the system as a whole experiences an increase in average token rate.
However, a limiting factor of this system is that the number of draft tokens produced, referred to as the draft length (DL), if fixed over multiple rounds of drafting, can reduce the average token rate. This may be caused due to over-utilization of a poorly performing draft model, or symmetrically, because the draft model is under-utilized which does not allow the speculative decoding system to reach its maximum possible performance. For example, since most target models are large foundation models, designed to have high accuracy on a variety of tasks, it is likely that a smaller draft model finetuned to match the target model distribution for a particular task may have varying levels of accuracy when the target model switches tasks. In such scenarios, one can see a high variance in the number of accepted draft tokens. That is, in some drafting rounds, almost all the draft tokens are accepted, whereas in some, almost all are rejected. Figure 1 considers the creative writing task from the Dolly-15k dataset [11]. For this dataset, we see that for a standard speculative decoding system operating with various static draft lengths, the number of accepted tokens follows a normal distribution, with num-accepted-tokens taking on almost every value from with varying frequency. We include similar figures for the CNN-DM (summarization) [12] and WMT-19 (German-English translation) [13] datasets along with the details to set up these speculative decoding systems in Appendix A Figures 7(a), 7(b).
This motivates the need for an adaptive draft length speculative decoding system where the draft length at every drafting round can be determined on-the-fly. AdaEDL approaches this problem with a go-no-go strategy: while drafting tokens, at every iteration, AdaEDL establishes a draft-stopping criteria by approximating a lower bound on the expected acceptance probability of the drafted token by using the entropy of the draft model logits at that iteration. If the criteria is satisfied, drafting stops and verification by the target model is performed. We provide a theoretical basis to our proposed draft-stopping criteria formulation, deriving how it relates to the acceptance probability of the draft model. To validate our approach, we perform experiments across various maximum draft length settings, across multiple datasets and sampling temperature settings, and for various target and draft model choices, showing that this new draft-stopping strategy is more effective and robust than draft-stopping strategies which simply use the probability value of the most-likely token as a draft-stopping criteria - for example, those used in BiLD [14] and Draft & Verify [15]. Indeed, AdaEDL may be used to complement either of these algorithms. At the same time, our proposed system avoids the need to train an independent network to act as a binary classifier for early draft-stopping for a specific model and task, such as the approaches followed by SpecDec++ [16] and DISCO [17], which makes AdaEDL a simple and straightforward improvement to boost the token rate of speculative decoding LLM systems.
2 Problem setting and existing methods
Following a notation similar to the original speculative decoding formulation, let be the target model whose inference we are trying to accelerate. Let be a more efficient approximation of this target model, referred to as the draft model. Let us denote the token in the prompt by . Then the probability distribution of at any given token may be denoted as . Similarly, the probability distribution of the draft model, may be denoted as . As noted in [1], several strategies such as nucleus sampling, top-k sampling, and others may each be viewed as sampling from an adjusted probability distribution. For notational simplicity, we refer to these probabilities, adjusted or not, as and . In this notation, we can now describe the various decoding techniques:
Autoregressive decoding
In this baseline technique, we only consider the target model and at every iteration, a new token is sampled as
Speculative decoding
The system first allows a draft model to consume the tokens . The draft model then autoregressively generates draft tokens , where is the maximum draft length. This can be viewed as sampling a token
After drafting, the target model evaluates these draft tokens in parallel, producing probabilities . The draft token is accepted if , producing a token . Otherwise, it is rejected with probability and a new token is sampled from an adjusted distribution . A token sampled via this method, referred to as rejection sampling, is guaranteed to follow the probability distribution of the original target model, that is, [1].
To further improve this process, the draft length may be made adaptive, to avoid the cost of drafting tokens that are likely to be rejected anyway.
Adaptive draft length via maximum confidence speculative decoding
Some systems consider the top-1 probability of the drafted logits. That is, at a given drafting stage, the system considers the token with the highest probability . If this value, the maximum confidence among any possible token, is less than a threshold, , the system is considered to be under-confident and drafting stops.
This is a simple method to avoid wastage during the drafting phase and is effectively employed alongside other techniques in Draft & Verify [15] and BiLD [14]. However, this scheme fails to take into account the overall probability distribution during drafting. This motivates the need for a new drafting stopping mechanism that considers the probabilities of all possible tokens while making a go-no-go decision.
Adaptive draft length via a trained predictor
Several works involve the training of a small network to act as a predictor to determine optimal draft lengths, for example, SpecDec++ [16] which trains a ResNet and DISCO [17] which trains a FFN as a predictor for early draft stopping. AdaEDL is distinct from these method and specifically aims to be training and parameter-free, similar to draft stopping via maximum confidence, to allow for greater generalization across datasets and models.
3 Adaptive entropy-based draft length speculative decoding (AdaEDL)
AdaEDL operates in the same setup as the above adaptive draft length techniques but instead, establishes a stopping criteria using an entropy-based lower bound on the token acceptance probability. If we consider the probability distribution of the draft model, and its corresponding entropy , we see that serves as an approximate lower bound on the expected acceptance rate. This allows us to formulate a drafting system where drafting is stopped if this criteria falls below a threshold , indicating that the expected acceptance rate is also below this threshold:
Here, is a hyperparameter. We include a detailed derivation of this lower-bound approximation in Appendix B. Additionally, we visualize AdaEDL alongside the baseline speculative decoding systems mentioned above in Figure 2, highlighting the novel early draft-stopping criteria we introduce.
Improving via dynamic updates
The stopping threshold can be further optimized by making it responsive to the current acceptance rate statistics of the system. In particular, we aim to achieve a target acceptance rate - increasing the stopping threshold if it is not currently being reached and decreasing it otherwise. By maintaining an exponential moving average of the acceptance rate and using it to update the stopping threshold in this manner, we can achieve better performance over longer generations. We modify the threshold update strategy described in [15] and describe it in Algorithm 1.
We further discuss the hyperparameters introduced here, , , , , , and , their typical values and ranges, as well as the sensitivity of AdaEDL to these hyperparameters in Section 4.4.
4 Experimental Results
In all of the following results, AdaEDL refers to our proposed method, described in in Section 3 with dynamically updated stopping thresholds as described by Algorithm 1. Max-Confidence-SPD refers to a speculative decoding system which implements the early draft-stopping strategy based on the token with the highest probability described in Section 2, which is further improved by using the same dynamic threshold strategy to have a fair comparison with AdaEDL. Base-SPD refers to vanilla speculative decoding with no early draft-stopping strategy employed and Autoregressive refers to standard autoregressive decoding of LLMs which is the baseline that speculative decoding systems attempt to improve on.
We evaluate the performance of AdaEDL across various datasets and tasks: Dolly-15k (creative writing) [11], WMT-19 (German-English translation) [13], and CNN-DM (summarization) [12]. The entire Dolly test dataset (708 samples), the entire WMT-19 test dataset (600 samples), and the first 1000 samples of the CNN-DM test dataset were used to ensure a large sample size. All token rate (TPS) numbers reported are calculated as the averages over all samples in each dataset. All experiments were performed on a single NVIDIA A100 with 80GB of GPU memory in FP32 precision. We acknowledge here that results may vary on different hardware which may lead to faster or slower inference of a draft or target model and in particular, their relative speed. We discuss this point in Appendix B.2.
We show in the following experiments that AdaEDL will always either exceed or match the performance of other decoding systems while also proving less sensitive to chosen hyperparameters and robust to the system’s settings.
4.1 Performance with fast finetuned draft models
In the following experiments, we use Llama2-7B as the target model with sampling temperature set to 0.7. The draft model, Llama2-Drafter-115M, is a 115M parameter model, distilled and finetuned using direct alignment [18] to closely match the target model distribution . As a result, even Base-SPD results in higher acceptance rates than one would observe from an off-the-shelf draft model.
In Table 1, we indicate the performance of AdaEDL across the various datasets and tasks mentioned above - Dolly-15k, WMT-19, and CNN-DM and report the average token rate on each of these datasets. We observe that AdaEDL consistently matches or beats both Max-Confidence-SPD and Base-SPD on all tasks by a competitive margin. In particular, AdaEDL has a significant advantage in the open-ended CNN-DM summarization task.
Max DL = | CNN-DM | Dolly-15k | WMT-19 |
---|---|---|---|
Autoregressive | |||
Base-SPD | |||
Max-Confidence-SPD | |||
AdaEDL | |||
Max DL = | CNN-DM | Dolly-15 | WMT-19 |
Autoregressive | |||
Base-SPD | |||
Max-Confidence-SPD | |||
AdaEDL | |||
Max DL = | CNN-DM | Dolly-15k | WMT-19 |
Autoregressive | |||
Base-SPD | |||
Max-Confidence-SPD | |||
AdaEDL |
4.2 Performance of AdaEDL with expensive draft models
A smaller draft model may be faster at drafting, but sacrifices acceptance rate. Symmetrically, a larger draft model may lead to higher acceptance rates, but lower overall inference speed due to its own cost of inference.
4.2.1 Pythia
The Pythia family of models [19] consists of models of sizes ranging from 70M to 12B parameters. As a result, they represent an ideal set of models to test the performance of various decoding systems when the ratio of the target model size to the draft model size is varied - that is, when is varied.
Figure 3 sets Pythia-6.9B as the target model and considers the performance of Pythia-1B, Pythia-410M, Pythia-160M, and Pythia-70M as draft models for the CNN-DM (summarization) task. We demonstrate that adaptive draft length techniques enable us to use upto larger draft models in a scenario where otherwise, the maximum token rate achievable with Base-SPD would have been tokens per second (TPS). We see that when Pythia-160M or Pythia-70M is used as the draft model, autoregressive decoding outperforms all speculative decoding methods with a token rate of TPS. Without adaptive draft length techniques, the maximum token rate achievable through Base-SPD is TPS with Pythia-1B as a draft model for an speedup. However, when Pythia-1B is used to draft for Pythia-6.9B with AdaEDL, the system achieves a token rate of TPS for a speedup. Thus, AdaEDL enables speculative decoding in a scenario where autoregressive decoding would normally be the candidate method. This also opens up the possibility of finetuning larger draft models, which would presumably lead to higher acceptance rates without sacrificing performance if AdaEDL were to be enabled. We believe that this is a promising direction of investigation for future work.
4.2.2 TinyLlama
Motivated by the results in Section 4.2, Figure 4 shows the performance of various decoding methods when the target model is Llama2-7B and the draft model is a standard 1B model - TinyLlama-1B [10]. We see that in this case, AdaEDL and Max-Confidence-SPD increase the token rate by as compared to autoregressive decoding, while Base-SPD negatively impacts performance, reducing the token rate by .
4.3 Performance of AdaEDL across sampling temperatures
In cases where the target distribution is difficult to predict, such as when the chosen sampling temperature is high, we see that Base-SPD is only able to produce modest gains in token rate - sometimes even resulting in poorer performance than autoregressive decoding. In Figure 5 we see that as we increase the sampling temperature from to on the CNN-DM dataset, the token rate of Base-SPD drops (for draft length ), and ends up being lower than standard autoregressive decoding. On the other hand, even at high sampling temperatures, AdaEDL provides an boost in token rate over the autoregressive baseline. AdaEDL also consistently outperforms the other 3 decoding methods across lower sampling temperatures and across maximum draft length settings. We perform similar experiments on the Dolly-15k and WMT-19 datasets in Appendix C Figures 8(a), 8(b), and see similarly consistent performance from AdaEDL.
4.4 Controllability and sensitivity of AdaEDL to hyperparameters
Entropy factor ()
In all our experiments, we set in the equation . We see that this simple approximation described in Appendix B, when coupled with the dynamic stopping threshold strategy described in Algorithm 1, produces strong results. Further hyperparameter search may be possible to improve this value for a particular dataset and we observe that values in the range typically work best. It may also be possible to dynamically update using the observed target model distribution as the system runs. We defer these investigations to future works.
Threshold update hyperparameters (, , , )
In all our experiments, we set , , , and , following [15], for the threshold update step for both AdaEDL and Max-Confidence-SPD.
Stopping threshold ()
The most significant hyperparameter in adaptive draft length methods is the early draft-stopping threshold, . In all our experiments, is updated dynamically according to the scheme described in Algorithm 1. Regardless, we find that both AdaEDL and Max-Confidence-SPD are sensitive to the initial choice of . We hypothesize that this may be due to the fact that many output generations are short in length, which may not result in enough drafting rounds for the system to converge to an optimal threshold. In Figure 6 we see that AdaEDL, even for sub-optimal choices, consistently outperforms Max-Confidence-SPD when its is also chosen sub-optimally. At the same time, AdaEDL performs better than this baseline if optimal are chosen for both methods as we see marked by the dashed line and in Section 4.1. We conduct experiments across datasets (CNN-DM, Dolly-15k, WMT-19), sampling temperatures (, , , ), and maximum draft length settings (, , ), and observe similar trends in Appendix D Figures 9(a), 9(b), 9(c), showing that AdaEDL can consistently boost token rate without the need for fine-grained hyperparameter search.
5 Conclusion
In this work, we present AdaEDL, an early stopping criteria for drafting in speculative decoding systems which uses the entropy of the draft model to estimate a lower bound on the current token’s acceptance rate. We show the efficacy of this new method across datasets, sampling temperatures, draft lengths, and choice of target and draft models, whether fine-tuned or off-the-shelf. AdaEDL boosts the performance of existing speculative decoding systems while also enabling efficient usage of much larger draft models which, if finetuned in future works, could potentially result in even more impressive gains in token rate. AdaEDL is training and parameter-free, is not dependent on a given dataset, and also offers a relaxed choice in hyperparameters, making it a simple, plug-and-play improvement to a variety of pre-existing speculative decoding LLM systems.
References
- Leviathan et al. [2023] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023. URL https://arxiv.org/abs/2211.17192.
- Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- OpenAI et al. [2024] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/abs/2212.08073.
- Rozière et al. [2024] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
- Liu et al. [2023a] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2023a. URL https://arxiv.org/abs/2308.03688.
- Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b. URL https://arxiv.org/abs/2304.08485.
- Zhang et al. [2024a] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024a. URL https://arxiv.org/abs/2401.02385.
- Conover et al. [2023] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- See et al. [2017] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://www.aclweb.org/anthology/P17-1099.
- [13] Wikimedia Foundation. Acl 2019 fourth conference on machine translation (wmt19), shared task: Machine translation of news. URL http://www.statmt.org/wmt19/translation-task.html.
- Kim et al. [2023] Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder, 2023. URL https://arxiv.org/abs/2302.07863.
- Zhang et al. [2024b] Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding, 2024b. URL https://arxiv.org/abs/2309.08168.
- Huang et al. [2024] Kaixuan Huang, Xudong Guo, and Mengdi Wang. Specdec++: Boosting speculative decoding via adaptive candidate lengths, 2024. URL https://arxiv.org/abs/2405.19715.
- Mamou et al. [2024] Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models, 2024. URL https://arxiv.org/abs/2405.04304.
- Goel et al. [2024] Raghavv Goel, Mukul Gagrani, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. Direct alignment of draft model for speculative decoding with chat-fine-tuned llms, 2024. URL https://arxiv.org/abs/2403.00858.
- Biderman et al. [2023] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373.
- Levin et al. [2008] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times. 2008. URL https://api.semanticscholar.org/CorpusID:117035435.
- Rioul [2024] Olivier Rioul. A historical perspective on schützenberger-pinsker inequalities (extended version). Information Geometry, May 2024. ISSN 2511-249X. doi: 10.1007/s41884-024-00138-z. URL http://dx.doi.org/10.1007/s41884-024-00138-z.
- Arora et al. [2023] Kushal Arora, Timothy J. O’Donnell, Doina Precup, Jason Weston, and Jackie C. K. Cheung. The stable entropy hypothesis and entropy-aware decoding: An analysis and algorithm for robust natural language generation, 2023. URL https://arxiv.org/abs/2302.06784.
Appendix A Acceptance rate variance for standard speculative decoding systems
We see that across tasks like summarization (CNN-DM) in Figure 7(a), translation (WMT-19) in Figure 7(b), and creative writing (Dolly-15k) in Figure 7(c), standard speculative decoding systems display a large variance in the number of tokens accepted per drafting round. This is observed across draft lengths , , and . This effect is particularly pronounced in the CNN-DM and Dolly-15k datasets, in which we observe standard deviations of tokens even within a maximum draft length of only (i.e., standard deviation), opening up room for significant optimization. Experiments are conducted with target model chosen as Llama2-7B and a 115M draft model finetuned via direct alignment with the target model distribution [18].
Appendix B Derivation of entropy-based draft-stopping criteria
Let and be the currently observed probability distributions of the draft and target model given some prefix . Following [1], the acceptance probability, of a token drafted from defined via rejection sampling is
On discrete domains, the total variation distance between these distributions is defined in the following manner [20] :
We see in [1] that is related to the acceptance probability, as
Moreover, by Pinsker’s inequality [21], we may relate the total variation distance to the Kullback-Liebler divergence,
Giving us,
Further, relates to the cross-entropy via
where is the entropy of the draft model distribution.
Now, let us note, while drafting, we do not yet have access to the target model distribution . We do know, however, that since , we have that
In this work, we choose a linear approximation of via a positive factor
This is motivated by the observation that in LLM systems, most of the variation seen in the cross-entropy between the draft and target model occurs due to the high entropy of the draft model. LLMs suitable to be target models follow the stable entropy hypothesis [22] with reasonable generations lying in a narrow entropy band.
Substituting this approximation, we have that
Thus, we see that the value acts as an approximate lower bound on the acceptance probability . By stopping the drafting of a new token if our lower-bound estimate falls below a threshold , we attempt to ensure that the acceptance probability of the potential new token will be greater than this threshold. If the acceptance probability does not meet our threshold, we choose not to draft the next token.
Thus, a draft-stopping criteria
implies that if drafting continues because , then we have an approximate lower bound on the acceptance probability of the drafted token via , i.e.,
B.1 Computational cost
The cost of computing entropy is on a single thread where is the size of the vocabulary. That said, this operation is highly parallelizable since the operations are independent. Thus, the overhead of AdaEDL is at most , but may be significantly reduced if implemented efficiently.
B.2 Impact of hardware chosen
An additional consideration is that the speedup achievable by a speculative decoding system depends on the cost of running the draft model, which depends on its size and the nature of the hardware it runs on. For example, a larger draft model may have a higher acceptance rate, but is also more expensive to run. An ideal system is one that balances these factors, taking into account the expected acceptance rate and computational cost incurred on a particular hardware. Future work that studies draft model cost across various processors would be valuable to designing such a system.
Appendix C Effect of target model sampling temperature
AdaEDL consistently outperforms the other 3 decoding methods across sampling temperatures. This trend is reflected across maximum draft lengths , , and across datasets as seen for the Dolly-15k dataset (Figure 8(a)), the WMT-19 dataset (Figure 8(b)), and the CNN-DM dataset (Figure 8(c)).
Appendix D Sensitivity to stopping thresholds
We see that AdaEDL is less sensitive to the choice of the stopping threshold , outperforming Max-Confidence-SPD even when suboptimal is chosen. This is reflected across temperatures and datasets as seen in Figures 9(a), 9(b), 9(c).