TensorOpera Router: A Multi-Model Router
for Efficient LLM Inference
Abstract
With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.
1 Introduction
Large Language Models (LLMs) have demonstrated remarkable performance across a diverse set of challenging domain-specific tasks [2]. However, no single LLM can outperform all others across every task and use case [21]. Recent works [11, 17, 7] highlight the urgent need for efficient tools that can unify the expertise of multiple LLMs, combining them into a single cohesive unit. Given the increased costs and latency associated with querying models hosted at different providers [4], it is critical for multi-LLM querying systems to efficiently route queries to the most appropriate LLM expert. This must be done while balancing query execution throughput, monetary costs, and model performance—a challenge we term the multi-LLM routing trilemma.
Our aim is to provide an empirical solution to this trilemma by showcasing the potential of a multi-LLM routing system that improves this balance. We propose an LLM routing system, called TensorOpera-Router (hereinafter referred to as TO-Router), to explore the feasibility of building a multi-LLM routing model that leverages the collective power of multiple LLM experts. TO-Router aims to efficiently, inexpensively, and accurately answer query prompts by selecting the most cost-effective and suitable LLM from a diverse set of expert models. Our contributions are as follows:
-
•
We empirically demonstrate the promise of different routing methods developed through the TO-Router system in balancing query execution time, query cost, and model performance, leading to significant gains.
-
•
We show that, on average, our routing system outperforms standalone model experts.
-
•
We demonstrate that routing methods trained to learn the embedding query space outperform naive routing methods.
-
•
We introduce a soft label based approach based on BERT similarity score. to train the proposed model routing methods.
-
•
We present a routing method based on a pre-trained BERT model that exhibits the best performance.
2 Background & Related Work
LLM Routing. Depending on the mechanism used by routing methods to decide the most suitable LLM(s) to answer a given prompt, two distinct routing categories have been recently introduced: predictive/classification routers, which do not generate LLM outputs in advance, but instead, they predict the best LLM to handle a given prompt based on specific performance metrics [11, 17, 22] and cascading routers, which refer to routing methods that process a query request by executing it over a series or combinations of LLMs [4] until specific quality criteria are met. To train the predictive routers, different training methods have been recently introduced that leverage data augmentation techniques and human preference data [17] or existing benchmark datasets [21] to improve routing predictions. In this work, we too develop and evaluate predictive routing methods trained on standardized benchmark datasets to efficiently classify and direct query prompts to the best LLM expert.
Mixture-of-Experts. A typical MoE architecture [15] consists of a set of expert models trained to specialize in different data regions and a gating network model that determines the contribution of each expert to the final prediction. Recently, MoEs have witnessed wide adoption in the LLM domain as well, where multiple MLP experts are integrated into encoder and decoder blocks, to boost the training of extremely large networks [20, 13, 8]. Similar to these MoE approaches, the LLM routing methods can be seen a special case of an MoE architecture, where the predictive routing model is the gating mechanism and the pool of LLMs the set of available experts.
Ensemble Learning. Model routing also bears similarities with ensemble machine learning [27] techniques that seek to provide better predictive outcomes by combining the predictions of multiple models. A key distinction between routing and ensemble techniques, like bagging [3], and boosting [9], is that models participating in an ensemble are typically trained on the (whole or subsets of) same dataset and therefore assumed to have a similar expertise. However, the router predicts and retrieves the predictions out of a varying set of LLMs experts that have been trained on highly diverse sets of data distributions.
3 TensorOpera Router System Overview
To effectively learn and deploy a multi-LLM routing model, a sequence of different critical development phases need to be executed, from data preparation to router model training and evaluation and model deployment/serving. The proposed TO-Router system’s end-to-end pipeline shown in Figure 1 facilitates the development of these phases and in practice has drastically helped to swiftly develop, prototype and deploy different model routing methods into real-world settings. 111Source code to be released.
Phase 1: Router Data Preparation. The generation of the training and testing dataset for the routing model is a multi-step process. First, we need to find the appropriate domain specific (e.g., bio, coding, physical sciences) instruction datasets and model experts to which we want the routing model to learn propagating relevant query prompts. Thereafter, we perform a forward pass over each expert model (step 1) to collect the associated metrics required to train and test the performance of the routing model and create the experts prediction dataset (step 2). In this work, we collect the following metrics per instruction prompt: {negative log likelihood, BERT similarity score (BERTSim), inference time in seconds, total input tokens, total output tokens}; for more details on these metrics, please see section 4.1. Once the expert prediction dataset is created, we select one of the collected metrics to generate soft labels (step 3) and prepare the final training and testing dataset for the routing model (step 4). In the current work, we use the BERTSim scores to create soft labels and train the routing expert model classifier. We use soft labels, since we want the routing model to learn the ranking of the experts in terms of their prediction performance. To generate the soft labels of each expert model and for each instruction record, we pass the selected metric (e.g., similarity score, log loss), through a softmax function with temperature. For instance, for the -instruction record, the expert (class) softmax probability is given by: , where is the total number of experts, is the temperature value, and is the vector of metric scores. In our evaluation, we generate expert’s soft labels based on the BERT similarity scores and with a temperature value of .
Phase 2: Router Training. Once the router’s training and testing dataset is created, we pass the instruction records through the router’s embedding model, e.g., Bag-of-Words, TF-IDF, BERT or other small or large language models, to create their vectorized representation (step 5). Then, we use the generated embeddings to train the prompt-to-expert classifier (step 6), using non-parametric, supervised learning approaches (e.g., kNN), classical deep learning models (e.g., MLP) or more advanced language sequencing pre-trained models (e.g., BERT). We provide more information on these routing models in section 4.4.
Phase 3: Router Deployment. When the final routing model is trained, the model is deployed as a standalone endpoint on the platform (step 7), ready to receive user queries (either through CLI or web interface). Whenever a new user query is submitted, the router first tokenizes and encodes the text of the incoming query prompt using the tuned embedding model from Phase 2 (step 8). Subsequently, the router performs a forward pass over the trained/fine-tuned classification model (e.g., MLP, BERT) and predicts the most relevant expert model (step 9). Depending on which expert model the classification model predicts, the router selects the respective expert-prompt adaptor to submit and execute the query. Once query execution completes, the router receives the reply from the expert model and forwards it back to the end user (step 10). Throughout the router’s deployment time, the platform provides the necessary monitoring capabilities to troubleshoot and tune the routing model, such as number of requests, queries’ semantic context, expert models hitting frequency, and total costs.
4 Experiments
In this section we discuss the metrics, expert models, benchmark datasets and routing methods we considered for evaluating the TO-Router system.
4.1 Evaluation Criteria
All expert models and routing methods are evaluated on four dimensions: (1) total inference cost, (2) throughput, (3) BERT similarity score, and (4) negative log loss (NLL).
Total Inference Cost. For any expert model the total cost to execute a given test query is measured based on the input and output token costs. For a model that was prompted with a sequence of test queries that were used a total number of input tokens, and the model generated a total number of output tokens, with a and cost per 1 million input and output tokens, respectively, the total cost for the entire test query sequence is measured by: . In the case of the routing methods that did not use one single model to answer the sequence of testing queries but routed different testing queries to different expert models , the total cost is measured as: . To measure the querying of standalone deployed expert models, we handpicked the price per million input and output tokens from different model providers. We provide the input and output token costs per model architecture in Table 2 in the Appendix section.
Throughput. To measure the querying execution performance of a expert model and of different routing methods for the entire test query set, we compute the throughput for each query as the fraction of total output tokens , generated by each model , over the inference time in seconds, i.e., time from query submission to query completion, . Specifically, the throughout for a single test query is measured as . For the entire set of test queries , the mean throughput is computed as: .
BERTSim. Given that each expert model uses its own vocabulary and tokenizer and to ensure that there is an equitable comparison between the responses generated by each expert, we evaluate the vectorized text similarity between the ground truth and the predicted answer of an expert through the cosine distance on the BERT embeddings. Such a vector representation allows for a soft measure of similarity [26]. We refer to this similarity score as BERTSim [26]. The cosine similarity of a reference (ground truth) vector and a candidate (predicted) vector is computed as: . For every expert model and routing method we measure the BERTSim score across all test queries and we compute the final BERTSim score as the mean of all scores.
Negative Log-Likelihood. We use the Negative Log-Likelihood (NLL) to measure the quality of the probabilistic predictions made by each expert model. Lower NLL values are indication that the model is assigning higher probabilities to the true classes and therefore reflecting better performance. In principle, a single sequence’s NLL is defined as:
where is the predicted probability of the -th token in the sequence given the input sequence and the previous tokens . In our evaluation, we measure the mean NLL over the generated sequence of every expert model and routing method across all test queries.
4.2 Expert Models
We choose several representative models across different domains as the expert models to verify the effectiveness of our routing method in the TO-Router system. For the Biomedical domain, we selected two variants from Llama-3-8B (BioLlama-7B) [19] and Mistral-7B (BioMistral-7B) [16] models 222In our evaluation, we refer to each model using its name in bold fonts.. Both models achieve excellent performance across many biomedical evaluation benchmarks. In the code domain, we select Meta’s officially released Llama2-7B (CodeLlama-7B) [18] variant trained on code datasets. In the general instruction-following domain, we incorporate three instruction-tuned versions of LLMs across different sizes, i.e., Fox-1.6B a recently introduced powerful small language model, Mistral-7B-Instruct (MistralAI-7B) [12], and Qwen-7B-Instruct (Qwen-7B) [25]. Finally, for the math domain, we choose a strong reasoning model trained on large amounts of math documents, MathDeepSeek-7B-Instruct (MathDeepSeek-7B) [10]. For more details regarding models architecture and domain fine-tuning please refer to section E in the Appendix.
4.3 Datasets
All the datasets listed here are widely used by LLM developers [23, 24, 12] to evaluate model performance in commonsense reasoning, coding, and medical domains. To generate the final training and testing data for the investigating routing methods, we gather all records together from all datasets and perform a stratified 80% train, 20% test split per dataset.
Ai2-ARC [5]. The Ai2-ARC dataset consists of 7,787 natural science questions designed for standardized tests. We use its challenge partition with 2,590 samples, which includes only those questions that were answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.
GSM8k [6]. GSM8k is a high-quality dataset of grade school-level math word problems, covering relatively simple math concepts with 7,473 training and 1,319 testing samples.
MBPP [1]. The MBPP dataset contains 974 basic programming problems suitable for entry-level programmers. It also includes text descriptions of the problems and test cases for functional correctness verification.
PubMedQA [14]. The PubMedQA dataset is a biomedical question-answering dataset designed for answering research questions with yes/no/maybe responses. It contains 1,000 manually labeled question-answer pairs for cross-validation and testing.
4.4 Methods
Below, we describe the predictive routing methods we used during our evaluation.
Random-Router. To evaluate the performance of a random router, for every test query we randomly pick an expert to execute the query. After performing this step for all test queries, we repeat the entire process for 10 times. Let be the collection of all experts, we randomly select an expert from in each trial. Let denote the expert randomly selected in the -th trial, then the entire random expect selection process can be represented as: . Once the collection of random experts is assembled, we submit the test query to each expert and collect all measurements to compute the evaluation metrics.
kNN-Router. The kNN-Router first encodes all training queries using a sentence transformer. Then, for every test query, , it finds its closest training query in terms of cosine similarity in the embedding space and subsequently executes the test query using the expert that exhibited the best performance for the most relevant training query. The best performing expert is the expert whose BERTSim score is the highest out of all the training query’s experts, :
A schematic flow of the 1NN-Router’s embedding similarity and expert selection is also shown in Figure 7. Given that we only need to find the most similar training query to a given test query, we subsequently refer to this method as 1NN-Router.
MLP-Router. For learning our predictive MLP-Router, we use a simple 2-layer perceptron:
To train the MLP model, we convert the training queries into their vector representation by fitting a Bag-of-Words model. To learn the ranking of experts in terms of prediction performance, we use cross entropy loss on the scaled BERTSim scores. We used ReLU and softmax as the hidden and output layers’ activation function, respectively.
BERT-Router. To learn the BERT-Router, we performed a full parameter finetuning on a BERT model for sequence classification. We appended a classification head with a softmax activation funciton on top of BERT’s final hidden layer outputs to map the BERT embeddings to the desired number of experts (classes):
To fine-tune BERT, we first tokenize and encode all input training queries’ text sequences using the BERT tokenizer and then update the pre-trained BERT model weights for a small number of epochs using cross entropy loss. Similar to the MLP-Router model, we train BERT-Router using the soft labels created by the scaled BERTSim scores.
Zero-Router. Following the work of [11], we also evaluate the performance of the routing methods against the average performance of the available LLMs without any routing logic (lower bound), i.e., no-routing approach.
Optimal. We compare against two optimal cases (upper bounds), one refers to the optimal BERTSim performance per dataset (shown in Figure 2), and the other to the optimal performance recorded across all three evaluating dimensions (i.e., cost, throughput, model performance, shown in Figure 4). In the former case, the optimal value is measured by averaging the best BERTSim score recorded for every test query by any expert. In the latter case, the optimal set of values is the minimum cost, maximum throughput and maximum performance recorded by any expert model or router method.
4.5 Evaluation
To systematically evaluate all investigating expert models in terms of query response times, we deployed each model on a machine employed with 8 NVIDIA DGX H100 GPUs. Figures 2 and 3 show the BERTSim score and NLL value comparison between all routing and optimal methods.
From the router vs. router comparison in Figures 2 and 3, it is shown that naive methods, such as Random-Router or 1NN-Router that do not learn the embedding space can lead to suboptimal performance, cf. 0.3 BERTSim score for Random- and 1NN- Routers to 0.4 and 0.45 of MLP- and BERT- Routers in the Ai2-ARC dataset. Analogously, when it comes to train routing models that learn the embedding space, cf. BERT-Router to MLP-Router, more complex routing methods (i.e., BERT-Router) can lead to better outcomes and match closer the optimal performance, especially in challenging domains like GSM8K, cf. BERT-Router’s NLL value of 1.803 to MLP-Router’s 2.286.
To conduct a more thorough evaluation between expert models and routing methods, in Table 1, we record all the numerical values collected throughout our experiments in terms of total monetary cost, query throughput, BERTSim score and NLL value. For every evaluating dimension, we also highlight with different colors the top-3 positions/rankings. The recorded values for the Zero-Router and the Optimal across all four dimensions are, Zero-Router: {$0.161, 153.242, 0.707, 3.295} and Optimal: {$0.118, 214.925, 0.783, 2.326}; we do not report these values in the table to emphasize the ranking between routing methods and standalone models. The MistralAI-7B exhibits the worst performance across all expert models, while the more recent small language model, Fox-1.6B, has the best performance across all expert models and evaluating dimensions. However, independent of Fox’s performance, the collective model power provided by the routing methods, especially of the BERT-Router method, outperforms any other standalone expert model.
Model / Router | Total Cost | Throughput | BERTSim | NLL |
BioLlama-8B | $0.195 | 155.613 | 0.686 | 3.408 |
BioMistral-8B | $0.125 | 208.399 | 0.669 | 3.581 |
CodeLlama-7B | $0.156 | 102.993 | 0.694 | 3.299 |
Fox-1.6B | $0.118 | 214.925 | 0.761 | 2.958 |
MathDeepSeek-7B | $0.138 | 187.166 | 0.746 | 3.286 |
MistralAI-7B | $0.223 | 89.587 | 0.694 | 4.205 |
Qwen-7B | $0.164 | 114.008 | 0.698 | 2.326 |
Random-Router | $0.143 | 209.171 | 0.715 | 3.316 |
1NN-Router | $0.131 | 205.715 | 0.697 | 3.271 |
MLP-Router | $0.147 | 177.508 | 0.773 | 3.164 |
BERT-Router | $0.122 | 213.145 | 0.783 | 3.091 |
By using as a reference routing method the BERT-Router approach and baseline the mean performance of all standalone model experts (i.e., the Zero-Router), we find that the BERT-Router leads to a close of 30% cost reduction and 40% query inference throughput increase compared to no routing at all. At the same time though, BERT-Router is capable of maintaining or slightly enhancing the average mean model performance, by a 11% in terms of BERT similarity score and lead to a 6% NLL reduction.
Finally, in Figure 4, we provide a 3D visualization of the optimization trilemma problem w.r.t. total monetary cost (x-axis), query throughput (y-axis) and model performance (z-axis). The Figure clearly shows that the BERT-Router method outperforms all other expert models and routing methods across all three evaluation criteria, while almost matching the optimal performance.
5 Unlock the Potential of Collaborative Routing: Edge-to-Cloud
So far, we have discussed routing methods that can route query prompts to the most suitable expert hosted on a cloud service. However, the bigger promise of a model router is how to be effectively deployed on edge devices to help decide on whether a query prompt can be answered locally by a small model running on the edge or routed to an expert hosted on the cloud. Figure 5 shows such an architecture, where a model router is deployed locally on the edge device. The router is responsible for deciding whether a user’s new request submitted to the edge device should be answered directly by a small model, such as a small language model (SLM), which is already running on the edge, or by a larger model deployed on a cloud provider, or through a combination of the two. By applying this approach, we can significantly reduce querying and communication costs on the edge while maintaining overall model and query performance.
To provide further insights towards the materialization of this edge-to-cloud collaborative approach, in the heatmap shown in Figure 6, we record the number of test queries answered by each expert model for every routing method. From the reported values, it is apparent that both the MLP-Router and the BERT-Router, route most of the test queries investigated during our experimental analysis (section 4) to the Fox-1.6B small language model, which is similar to the behavior observed by the Optimal (oracle) approach. However, other approaches like the Random-Router and 1NN-Router, distribute almost equally the number of queries across all model experts.
If we assume that an SLM, like the Fox-1.6B model, is deployed on the edge, our analysis shows that by learning the embedding space of existing query prompts using an MLP or BERT-based routing approach, the majority of queries will be forwarded to the most suitable expert, which in this case is the SLM. As a result, deploying such routing methods on the edge enables edge devices to decide whether a query prompt should be routed to the local SLM or a cloud model expert, paving the way for the effective development of collaborative edge-to-cloud model routing techniques.
6 Conclusion
We present our multi-LLM routing system, called TO-Router, for the first time. With the TO-Router system, users can easily interact with multiple LLM expert models hosted on the same or different platform providers, without being limited to a single monolithic LLM system. By utilizing a routing method capable of learning the embedding space of query prompts, such as MLP- and BERT-based methods, users can benefit from significant cost savings (up to 30%), improved query response times (up to 40%), and enhanced model performance (up to 10%). As part of our immediate future plans, we aim to evaluate the feasibility of dynamically adding and removing model experts during the router’s endpoint deployment and test the routing efficacy of both small and large pre-trained language models.
References
- Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Beeching et al. [2023] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. Hugging Face, 2023.
- Breiman [1996] L Breiman. Bagging predictors machine learning 24 (2), 123-140 (1996) 10.1023. A: 1018054314350, 1996.
- Chen et al. [2023] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
- Clark et al. [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Ding et al. [2024] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618, 2024.
- Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Hu et al. [2024] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024.
- Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Jin et al. [2019] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Jordan and Jacobs [1994] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
- Labrak et al. [2024] Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
- Ong et al. [2024] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024.
- Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Shnitzer et al. [2023] Tal Shnitzer, Anthony Ou, M\́mathbf{i}rian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023.
- Srivatsa et al. [2024] KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Yang et al. [2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671.
- Zhang et al. [2019] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Zhou [2012] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
Appendix A kNN-Router Diagram
Appendix B Model Pricing
Model Type | $$ / 1M Input Tokens | $$ / 1M Output Tokens |
DeepSeek-8B | $0.14 | $0.28 |
Fox-1.6B | $0.20 | $0.20 |
Llama-8B | $0.20 | $0.20 |
Mistral-8B | $0.25 | $0.25 |
Qwen-7B | $0.20 | $0.20 |
Appendix C Router Models Data Preparation
To generate experts’ soft labels to train the MLP and BERT-Router models, we used the BERT similarity scores and set the temperature value of the softmax function to 10, i.e., . To compute the closest training query to a given test query in the case of the 1NN-Router, we compute the queries’ embeddings using the sentence transformer library. 333https://www.sbert.net/docs/quickstart.html
Appendix D Router Models Training Hyperparameters
The total number of experts is 7. The MLP-Router’s hidden layer size is 256. The random seed for all experiments is set to 42. The applied optimizer for training both the MLP and BERT routers is Adam with weight decay, the learning rate is set to and , respectively. We also applied L2 norm regularization with . The batch size is set to 8 and the total number of training (MLP model) and fine-tuning (BERT model) is set to 5 epochs. The BERT model for the router is bert-base-uncased. To counter dataset class/expert imbalance we observed while generating the training and testing datasets, i.e., an expert model might be more suitable to answer many more queries than other experts, we used a sample weighting function, with the weight of each sample being the inverse proportion count of samples per class in the entire training dataset, i.e., the total weight sample proportion for each class/expert across all experts , is measured as , with the final weight value per training sample being equal to .
Appendix E Expert Model Resources
Below, we provide details regarding the internal architecture and type of models we used as our expert models in this study. For every instructed model, if not otherwise specified, we set the maximum tokens generation length to 512, the temperature to 0.7, and the top-p parameter to 0.95.
-
•
BioLlama-7B 444https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B:This model is an advanced Llama-3-based model designed specifically for the biomedical domain. With policy optimization and a custom medical instruction dataset, it outperforms even the ChatGPT API. Following the recommended parameters, we set max new tokens to 256, temperature to 0.1 and top-p to 0.9.
-
•
BioMistral-7B 555https://huggingface.co/BioMistral/BioMistral-7B: This Mistral-based model, pre-trained using textual data from PubMed Central Open Access, is well-suited for medical domains and achieves performance comparable to the ChatGPT API across all medical evaluation benchmarks.
-
•
CodeLlama-7B 666https://huggingface.co/codellama/CodeLlama-7b-hf: This model adapts the Llama-2-7B model with a large collection of code datasets, incorporating an infilling training objective and long input context subsets.
-
•
Fox-1.6B 777https://huggingface.co/tensoropera/Fox-1-1.6B-Instruct-v0.1: Fox-1 is a decoder-only transformer-based small language model with 1.6B parameters, developed by TensorOpera AI. Fox-1-Instruct-v0.1 is an instruction-tuned version with an 8K native context length, finetuned with 5B tokens of instruction-following and multi-turn conversation data.
-
•
Mistral-7B-Instruct 888https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2: This model is an officially released instruct fine-tuned version of the Mistral-7B-v0.2.
-
•
Qwen-7B-Instruct 999https://huggingface.co/Qwen/Qwen2-7B-Instruct: This model is an officially released instruct fine-tuned version of the Qwen2-7B.
-
•
MathDeepSeek-7B 101010https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct: This model, initialized with DeepSeek-Coder-v1.5 7B, continues pre-training on math-related tokens sourced from the web, achieving impressive scores on the competition-level MATH benchmark. Following the recommended parameters, we set max new tokens to 512, top-k to 50 and top-p to 0.95.