TensorOpera Router: A Multi-Model Router
for Efficient LLM Inference

Dimitris Stripelis TensorOpera, Inc., Palo Alto, CA, USA Zijian Hu TensorOpera, Inc., Palo Alto, CA, USA Jipeng Zhang TensorOpera, Inc., Palo Alto, CA, USA Zhaozhuo Xu TensorOpera, Inc., Palo Alto, CA, USA Alay Dilipbhai Shah TensorOpera, Inc., Palo Alto, CA, USA Han Jin TensorOpera, Inc., Palo Alto, CA, USA Yuhang Yao TensorOpera, Inc., Palo Alto, CA, USA Salman Avestimehr TensorOpera, Inc., Palo Alto, CA, USA Chaoyang He TensorOpera, Inc., Palo Alto, CA, USA
Abstract

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response methods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast and inexpensive but qualitatively inferior. To address this challenge, we present TO-Router, a non-monolithic LLM querying system that seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models, TO-Router improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing model performance by up to 10%.

1 Introduction

Large Language Models (LLMs) have demonstrated remarkable performance across a diverse set of challenging domain-specific tasks [2]. However, no single LLM can outperform all others across every task and use case [21]. Recent works [11, 17, 7] highlight the urgent need for efficient tools that can unify the expertise of multiple LLMs, combining them into a single cohesive unit. Given the increased costs and latency associated with querying models hosted at different providers [4], it is critical for multi-LLM querying systems to efficiently route queries to the most appropriate LLM expert. This must be done while balancing query execution throughput, monetary costs, and model performance—a challenge we term the multi-LLM routing trilemma.

Our aim is to provide an empirical solution to this trilemma by showcasing the potential of a multi-LLM routing system that improves this balance. We propose an LLM routing system, called TensorOpera-Router (hereinafter referred to as TO-Router), to explore the feasibility of building a multi-LLM routing model that leverages the collective power of multiple LLM experts. TO-Router aims to efficiently, inexpensively, and accurately answer query prompts by selecting the most cost-effective and suitable LLM from a diverse set of expert models. Our contributions are as follows:

  • We empirically demonstrate the promise of different routing methods developed through the TO-Router system in balancing query execution time, query cost, and model performance, leading to significant gains.

  • We show that, on average, our routing system outperforms standalone model experts.

  • We demonstrate that routing methods trained to learn the embedding query space outperform naive routing methods.

  • We introduce a soft label based approach based on BERT similarity score. to train the proposed model routing methods.

  • We present a routing method based on a pre-trained BERT model that exhibits the best performance.

Refer to caption
Figure 1: TO-Router system’s overview of router data preparation, router model training and deployment pipelines.

2 Background & Related Work

LLM Routing. Depending on the mechanism used by routing methods to decide the most suitable LLM(s) to answer a given prompt, two distinct routing categories have been recently introduced: predictive/classification routers, which do not generate LLM outputs in advance, but instead, they predict the best LLM to handle a given prompt based on specific performance metrics [11, 17, 22] and cascading routers, which refer to routing methods that process a query request by executing it over a series or combinations of LLMs [4] until specific quality criteria are met. To train the predictive routers, different training methods have been recently introduced that leverage data augmentation techniques and human preference data [17] or existing benchmark datasets [21] to improve routing predictions. In this work, we too develop and evaluate predictive routing methods trained on standardized benchmark datasets to efficiently classify and direct query prompts to the best LLM expert.

Mixture-of-Experts. A typical MoE architecture [15] consists of a set of expert models trained to specialize in different data regions and a gating network model that determines the contribution of each expert to the final prediction. Recently, MoEs have witnessed wide adoption in the LLM domain as well, where multiple MLP experts are integrated into encoder and decoder blocks, to boost the training of extremely large networks [20, 13, 8]. Similar to these MoE approaches, the LLM routing methods can be seen a special case of an MoE architecture, where the predictive routing model is the gating mechanism and the pool of LLMs the set of available experts.

Ensemble Learning. Model routing also bears similarities with ensemble machine learning [27] techniques that seek to provide better predictive outcomes by combining the predictions of multiple models. A key distinction between routing and ensemble techniques, like bagging [3], and boosting [9], is that models participating in an ensemble are typically trained on the (whole or subsets of) same dataset and therefore assumed to have a similar expertise. However, the router predicts and retrieves the predictions out of a varying set of LLMs experts that have been trained on highly diverse sets of data distributions.

3 TensorOpera Router System Overview

To effectively learn and deploy a multi-LLM routing model, a sequence of different critical development phases need to be executed, from data preparation to router model training and evaluation and model deployment/serving. The proposed TO-Router system’s end-to-end pipeline shown in Figure 1 facilitates the development of these phases and in practice has drastically helped to swiftly develop, prototype and deploy different model routing methods into real-world settings. 111Source code to be released.

Phase 1: Router Data Preparation. The generation of the training and testing dataset for the routing model is a multi-step process. First, we need to find the appropriate domain specific (e.g., bio, coding, physical sciences) instruction datasets and model experts to which we want the routing model to learn propagating relevant query prompts. Thereafter, we perform a forward pass over each expert model (step 1) to collect the associated metrics required to train and test the performance of the routing model and create the experts prediction dataset (step 2). In this work, we collect the following metrics per instruction prompt: {negative log likelihood, BERT similarity score (BERTSim), inference time in seconds, total input tokens, total output tokens}; for more details on these metrics, please see section 4.1. Once the expert prediction dataset is created, we select one of the collected metrics to generate soft labels (step 3) and prepare the final training and testing dataset for the routing model (step 4). In the current work, we use the BERTSim scores to create soft labels and train the routing expert model classifier. We use soft labels, since we want the routing model to learn the ranking of the experts in terms of their prediction performance. To generate the soft labels of each expert model and for each instruction record, we pass the selected metric (e.g., similarity score, log loss), through a softmax function with temperature. For instance, for the r𝑟ritalic_r-instruction record, the expert (class) softmax probability φrsubscript𝜑𝑟\varphi_{r}italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is given by: φr(𝐱;T)=exp(xiT)j=1Eexp(xjT)subscript𝜑𝑟𝐱𝑇subscript𝑥𝑖𝑇superscriptsubscript𝑗1𝐸subscript𝑥𝑗𝑇\varphi_{r}(\mathbf{x};T)=\tfrac{\exp\left(\tfrac{x_{i}}{T}\right)}{\sum_{j=1}% ^{E}\exp\left(\tfrac{x_{j}}{T}\right)}italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ; italic_T ) = divide start_ARG roman_exp ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ) end_ARG, where E𝐸Eitalic_E is the total number of experts, T𝑇Titalic_T is the temperature value, and 𝐱=(x1,x2,,xE)𝐱subscript𝑥1subscript𝑥2subscript𝑥𝐸\mathbf{x}=(x_{1},x_{2},\ldots,x_{E})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) is the vector of metric scores. In our evaluation, we generate expert’s soft labels based on the BERT similarity scores and with a temperature value of T=10𝑇10T=10italic_T = 10.

Phase 2: Router Training. Once the router’s training and testing dataset is created, we pass the instruction records through the router’s embedding model, e.g., Bag-of-Words, TF-IDF, BERT or other small or large language models, to create their vectorized representation (step 5). Then, we use the generated embeddings to train the prompt-to-expert classifier (step 6), using non-parametric, supervised learning approaches (e.g., kNN), classical deep learning models (e.g., MLP) or more advanced language sequencing pre-trained models (e.g., BERT). We provide more information on these routing models in section 4.4.

Phase 3: Router Deployment. When the final routing model is trained, the model is deployed as a standalone endpoint on the platform (step 7), ready to receive user queries (either through CLI or web interface). Whenever a new user query is submitted, the router first tokenizes and encodes the text of the incoming query prompt using the tuned embedding model from Phase 2 (step 8). Subsequently, the router performs a forward pass over the trained/fine-tuned classification model (e.g., MLP, BERT) and predicts the most relevant expert model (step 9). Depending on which expert model the classification model predicts, the router selects the respective expert-prompt adaptor to submit and execute the query. Once query execution completes, the router receives the reply from the expert model and forwards it back to the end user (step 10). Throughout the router’s deployment time, the platform provides the necessary monitoring capabilities to troubleshoot and tune the routing model, such as number of requests, queries’ semantic context, expert models hitting frequency, and total costs.

4 Experiments

In this section we discuss the metrics, expert models, benchmark datasets and routing methods we considered for evaluating the TO-Router system.

4.1 Evaluation Criteria

All expert models and routing methods are evaluated on four dimensions: (1) total inference cost, (2) throughput, (3) BERT similarity score, and (4) negative log loss (NLL).

Total Inference Cost. For any expert model the total cost to execute a given test query is measured based on the input and output token costs. For a model m𝑚mitalic_m that was prompted with a sequence of test queries that were used a total number of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT input tokens, and the model generated a total number of Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT output tokens, with a cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT cost per 1 million input and output tokens, respectively, the total cost for the entire test query sequence is measured by: Cm=Ti1e6ci+To1e6cosubscript𝐶𝑚subscript𝑇𝑖1𝑒6subscript𝑐𝑖subscript𝑇𝑜1𝑒6subscript𝑐𝑜C_{m}=\tfrac{T_{i}}{1e6}*c_{i}+\tfrac{T_{o}}{1e6}*c_{o}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 italic_e 6 end_ARG ∗ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG 1 italic_e 6 end_ARG ∗ italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. In the case of the routing methods that did not use one single model to answer the sequence of testing queries but routed different testing queries to different expert models M𝑀Mitalic_M, the total cost is measured as: Cr=mMCmsubscript𝐶𝑟subscript𝑚𝑀subscript𝐶𝑚C_{r}=\sum_{m\in M}C_{m}italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. To measure the querying of standalone deployed expert models, we handpicked the price per million input and output tokens from different model providers. We provide the input and output token costs per model architecture in Table 2 in the Appendix section.

Throughput. To measure the querying execution performance of a expert model and of different routing methods for the entire test query set, we compute the throughput for each query as the fraction of total output tokens Tmosubscriptsuperscript𝑇𝑜𝑚T^{o}_{m}italic_T start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, generated by each model m𝑚mitalic_m, over the inference time in seconds, i.e., time from query submission to query completion, tmssubscriptsuperscript𝑡𝑠𝑚t^{s}_{m}italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Specifically, the throughout for a single test query i𝑖iitalic_i is measured as τi=Tmotmssubscript𝜏𝑖subscriptsuperscript𝑇𝑜𝑚subscriptsuperscript𝑡𝑠𝑚\tau_{i}=\tfrac{T^{o}_{m}}{t^{s}_{m}}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_T start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG. For the entire set of test queries N𝑁Nitalic_N, the mean throughput τ~~𝜏\widetilde{\tau}over~ start_ARG italic_τ end_ARG is computed as: τ~=1NiNτi~𝜏1𝑁superscriptsubscript𝑖𝑁subscript𝜏𝑖\widetilde{\tau}=\frac{1}{N}\sum_{i}^{N}\tau_{i}over~ start_ARG italic_τ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

BERTSim. Given that each expert model uses its own vocabulary and tokenizer and to ensure that there is an equitable comparison between the responses generated by each expert, we evaluate the vectorized text similarity between the ground truth and the predicted answer of an expert through the cosine distance on the BERT embeddings. Such a vector representation allows for a soft measure of similarity [26]. We refer to this similarity score as BERTSim [26]. The cosine similarity of a reference (ground truth) vector xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a candidate (predicted) vector x^jsubscript^𝑥𝑗\widehat{x}_{j}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is computed as: 𝐱i𝐱^j𝐱i𝐱^jsuperscriptsubscript𝐱𝑖topsubscript^𝐱𝑗normsubscript𝐱𝑖normsubscript^𝐱𝑗\frac{\mathbf{x}_{i}^{\top}\mathbf{\widehat{x}}_{j}}{\|\mathbf{x}_{i}\|\|% \mathbf{\widehat{x}}_{j}\|}divide start_ARG bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG. For every expert model and routing method we measure the BERTSim score across all test queries and we compute the final BERTSim score as the mean of all scores.

Negative Log-Likelihood. We use the Negative Log-Likelihood (NLL) to measure the quality of the probabilistic predictions made by each expert model. Lower NLL values are indication that the model is assigning higher probabilities to the true classes and therefore reflecting better performance. In principle, a single sequence’s NLL is defined as:

NLL=t=1TlogP(ytX,y1:t1)subscriptNLLsuperscriptsubscript𝑡1𝑇𝑃conditionalsubscript𝑦𝑡𝑋subscript𝑦:1𝑡1\mathcal{L}_{\text{NLL}}=-\sum_{t=1}^{T}\log P(y_{t}\mid X,y_{1:t-1})caligraphic_L start_POSTSUBSCRIPT NLL end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X , italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT )

where P(ytX,y1:t1)𝑃conditionalsubscript𝑦𝑡𝑋subscript𝑦:1𝑡1P(y_{t}\mid X,y_{1:t-1})italic_P ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_X , italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) is the predicted probability of the t𝑡titalic_t-th token in the sequence given the input sequence X𝑋Xitalic_X and the previous tokens y1:t1subscript𝑦:1𝑡1y_{1:t-1}italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT. In our evaluation, we measure the mean NLL over the generated sequence of every expert model and routing method across all test queries.

4.2 Expert Models

We choose several representative models across different domains as the expert models to verify the effectiveness of our routing method in the TO-Router system. For the Biomedical domain, we selected two variants from Llama-3-8B (BioLlama-7B[19] and Mistral-7B (BioMistral-7B[16] models 222In our evaluation, we refer to each model using its name in bold fonts.. Both models achieve excellent performance across many biomedical evaluation benchmarks. In the code domain, we select Meta’s officially released Llama2-7B (CodeLlama-7B[18] variant trained on code datasets. In the general instruction-following domain, we incorporate three instruction-tuned versions of LLMs across different sizes, i.e., Fox-1.6B a recently introduced powerful small language model, Mistral-7B-Instruct (MistralAI-7B[12], and Qwen-7B-Instruct (Qwen-7B[25]. Finally, for the math domain, we choose a strong reasoning model trained on large amounts of math documents, MathDeepSeek-7B-Instruct (MathDeepSeek-7B[10]. For more details regarding models architecture and domain fine-tuning please refer to section E in the Appendix.

4.3 Datasets

All the datasets listed here are widely used by LLM developers [23, 24, 12] to evaluate model performance in commonsense reasoning, coding, and medical domains. To generate the final training and testing data for the investigating routing methods, we gather all records together from all datasets and perform a stratified 80% train, 20% test split per dataset.

Ai2-ARC [5]. The Ai2-ARC dataset consists of 7,787 natural science questions designed for standardized tests. We use its challenge partition with 2,590 samples, which includes only those questions that were answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm.

GSM8k [6]. GSM8k is a high-quality dataset of grade school-level math word problems, covering relatively simple math concepts with 7,473 training and 1,319 testing samples.

MBPP [1]. The MBPP dataset contains 974 basic programming problems suitable for entry-level programmers. It also includes text descriptions of the problems and test cases for functional correctness verification.

PubMedQA [14]. The PubMedQA dataset is a biomedical question-answering dataset designed for answering research questions with yes/no/maybe responses. It contains 1,000 manually labeled question-answer pairs for cross-validation and testing.

4.4 Methods

Below, we describe the predictive routing methods we used during our evaluation.

Random-Router. To evaluate the performance of a random router, for every test query we randomly pick an expert to execute the query. After performing this step for all test queries, we repeat the entire process for 10 times. Let 𝐄=(e1,e2,,eN)𝐄subscript𝑒1subscript𝑒2subscript𝑒𝑁\mathbf{E}=(e_{1},e_{2},\ldots,e_{N})bold_E = ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) be the collection of all experts, we randomly select an expert from 𝐄𝐄\mathbf{E}bold_E in each trial. Let eijsuperscriptsubscript𝑒𝑖𝑗e_{i}^{j}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the i𝑖iitalic_i expert randomly selected in the j𝑗jitalic_j-th trial, then the entire random expect selection process can be represented as: {ei1,,ei10}superscriptsubscript𝑒𝑖1superscriptsubscript𝑒𝑖10\{e_{i}^{1},\ldots,e_{i}^{10}\}{ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT }. Once the collection of random experts is assembled, we submit the test query to each expert and collect all measurements to compute the evaluation metrics.

kNN-Router. The kNN-Router first encodes all training queries 𝐪𝐢Dtsubscript𝐪𝐢superscript𝐷𝑡\mathbf{q_{i}}\in D^{t}bold_q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using a sentence transformer. Then, for every test query, 𝐪𝐭subscript𝐪𝐭\mathbf{q_{t}}bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT, it finds its closest training query 𝐪𝐢superscriptsubscript𝐪𝐢\mathbf{q_{i}^{\prime}}bold_q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in terms of cosine similarity in the embedding space and subsequently executes the test query using the expert that exhibited the best performance for the most relevant training query. The best performing expert 𝐞𝐢superscriptsubscript𝐞𝐢\mathbf{e_{i}^{\prime}}bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the expert whose BERTSim score is the highest out of all the training query’s experts, qi(E)superscriptsubscript𝑞𝑖𝐸q_{i}^{\prime}(E)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_E ):

𝐪𝐢=miniDt(𝐪𝐢𝐪𝐭𝐪𝐢𝐪𝐭)superscriptsubscript𝐪𝐢subscript𝑖superscript𝐷𝑡subscript𝐪𝐢subscript𝐪𝐭normsubscript𝐪𝐢normsubscript𝐪𝐭\mathbf{q_{i}^{\prime}}=\min_{i\in D^{t}}(\frac{\mathbf{q_{i}}\cdot\mathbf{q_{% t}}}{\|\mathbf{q_{i}}\|\|\mathbf{q_{t}}\|})bold_q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_i ∈ italic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG bold_q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ⋅ bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_q start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ∥ ∥ bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∥ end_ARG )
𝐞𝐢=maxjqi(E)(BERTSimj)superscriptsubscript𝐞𝐢subscript𝑗superscriptsubscript𝑞𝑖𝐸𝐵𝐸𝑅𝑇𝑆𝑖subscript𝑚𝑗\mathbf{e_{i}^{\prime}}=\max_{j\in q_{i}^{\prime}(E)}(BERTSim_{j})bold_e start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_j ∈ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_E ) end_POSTSUBSCRIPT ( italic_B italic_E italic_R italic_T italic_S italic_i italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

A schematic flow of the 1NN-Router’s embedding similarity and expert selection is also shown in Figure 7. Given that we only need to find the most similar training query to a given test query, we subsequently refer to this method as 1NN-Router.

MLP-Router. For learning our predictive MLP-Router, we use a simple 2-layer perceptron:

yk=φ(j=1mwjk(2)σ(i=1nwij(1)xi+bj(1))+bk(2))subscript𝑦𝑘𝜑superscriptsubscript𝑗1𝑚superscriptsubscript𝑤𝑗𝑘2𝜎superscriptsubscript𝑖1𝑛superscriptsubscript𝑤𝑖𝑗1subscript𝑥𝑖superscriptsubscript𝑏𝑗1superscriptsubscript𝑏𝑘2y_{k}=\varphi\left(\sum_{j=1}^{m}w_{jk}^{(2)}\sigma\left(\sum_{i=1}^{n}w_{ij}^% {(1)}x_{i}+b_{j}^{(1)}\right)+b_{k}^{(2)}\right)italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_φ ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_σ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT )

To train the MLP model, we convert the training queries into their vector representation by fitting a Bag-of-Words model. To learn the ranking of experts in terms of prediction performance, we use cross entropy loss on the scaled BERTSim scores. We used ReLU (σ)𝜎(\sigma)( italic_σ ) and softmax (φ)𝜑(\varphi)( italic_φ ) as the hidden and output layers’ activation function, respectively.

BERT-Router. To learn the BERT-Router, we performed a full parameter finetuning on a BERT model for sequence classification. We appended a classification head with a softmax activation funciton on top of BERT’s final hidden layer outputs to map the BERT embeddings H𝐻Hitalic_H to the desired number of experts (classes):

y=softmax(WH+b),H=BERT(X)formulae-sequence𝑦softmax𝑊𝐻𝑏𝐻BERT𝑋y=\text{softmax}(WH+b),\;H=\text{BERT}(X)italic_y = softmax ( italic_W italic_H + italic_b ) , italic_H = BERT ( italic_X )

To fine-tune BERT, we first tokenize and encode all input training queries’ text sequences X𝑋Xitalic_X using the BERT tokenizer and then update the pre-trained BERT model weights for a small number of epochs using cross entropy loss. Similar to the MLP-Router model, we train BERT-Router using the soft labels created by the scaled BERTSim scores.

Zero-Router. Following the work of [11], we also evaluate the performance of the routing methods against the average performance of the available LLMs without any routing logic (lower bound), i.e., no-routing approach.

Optimal. We compare against two optimal cases (upper bounds), one refers to the optimal BERTSim performance per dataset (shown in Figure 2), and the other to the optimal performance recorded across all three evaluating dimensions (i.e., cost, throughput, model performance, shown in Figure 4). In the former case, the optimal value is measured by averaging the best BERTSim score recorded for every test query by any expert. In the latter case, the optimal set of values is the minimum cost, maximum throughput and maximum performance recorded by any expert model or router method.

4.5 Evaluation

To systematically evaluate all investigating expert models in terms of query response times, we deployed each model on a machine employed with 8 NVIDIA DGX H100 GPUs. Figures 2 and 3 show the BERTSim score and NLL value comparison between all routing and optimal methods.

Refer to caption
Figure 2: Router performance per dataset: BERT similarity score.
Refer to caption
Figure 3: Router performance per dataset: Negative Log-Likelihood.

From the router vs. router comparison in Figures 2 and 3, it is shown that naive methods, such as Random-Router or 1NN-Router that do not learn the embedding space can lead to suboptimal performance, cf. 0.3 BERTSim score for Random- and 1NN- Routers to 0.4 and 0.45 of MLP- and BERT- Routers in the Ai2-ARC dataset. Analogously, when it comes to train routing models that learn the embedding space, cf. BERT-Router to MLP-Router, more complex routing methods (i.e., BERT-Router) can lead to better outcomes and match closer the optimal performance, especially in challenging domains like GSM8K, cf. BERT-Router’s NLL value of 1.803 to MLP-Router’s 2.286.

To conduct a more thorough evaluation between expert models and routing methods, in Table 1, we record all the numerical values collected throughout our experiments in terms of total monetary cost, query throughput, BERTSim score and NLL value. For every evaluating dimension, we also highlight with different colors the top-3 positions/rankings. The recorded values for the Zero-Router and the Optimal across all four dimensions are, Zero-Router: {$0.161, 153.242, 0.707, 3.295} and Optimal: {$0.118, 214.925, 0.783, 2.326}; we do not report these values in the table to emphasize the ranking between routing methods and standalone models. The MistralAI-7B exhibits the worst performance across all expert models, while the more recent small language model, Fox-1.6B, has the best performance across all expert models and evaluating dimensions. However, independent of Fox’s performance, the collective model power provided by the routing methods, especially of the BERT-Router method, outperforms any other standalone expert model.

Model / Router Total Cost Throughput BERTSim NLL
BioLlama-8B $0.195 155.613 0.686 3.408
BioMistral-8B $0.125 208.399 0.669 3.581
CodeLlama-7B $0.156 102.993 0.694 3.299
Fox-1.6B $0.118 214.925 0.761 2.958
MathDeepSeek-7B $0.138 187.166 0.746 3.286
MistralAI-7B $0.223 89.587 0.694 4.205
Qwen-7B $0.164 114.008 0.698 2.326
Random-Router $0.143 209.171 0.715 3.316
1NN-Router $0.131 205.715 0.697 3.271
MLP-Router $0.147 177.508 0.773 3.164
BERT-Router $0.122 213.145 0.783 3.091
Table 1: Total querying cost, mean throughput and cosine similarity between predicted and expected answers per model and router considering all the four benchmark datasets. Box coloring represents the following ranking column-wise: rank 1, rank 2, rank 3.
Refer to caption
Figure 4: A holistic view of model performance, throughput and total querying cost for standalone deployed expert models and different routing methods.

By using as a reference routing method the BERT-Router approach and baseline the mean performance of all standalone model experts (i.e., the Zero-Router), we find that the BERT-Router leads to a close of 30% cost reduction and 40% query inference throughput increase compared to no routing at all. At the same time though, BERT-Router is capable of maintaining or slightly enhancing the average mean model performance, by a 11% in terms of BERT similarity score and lead to a 6% NLL reduction.

Finally, in Figure 4, we provide a 3D visualization of the optimization trilemma problem w.r.t. total monetary cost (x-axis), query throughput (y-axis) and model performance (z-axis). The Figure clearly shows that the BERT-Router method outperforms all other expert models and routing methods across all three evaluation criteria, while almost matching the optimal performance.

5 Unlock the Potential of Collaborative Routing: Edge-to-Cloud

So far, we have discussed routing methods that can route query prompts to the most suitable expert hosted on a cloud service. However, the bigger promise of a model router is how to be effectively deployed on edge devices to help decide on whether a query prompt can be answered locally by a small model running on the edge or routed to an expert hosted on the cloud. Figure 5 shows such an architecture, where a model router is deployed locally on the edge device. The router is responsible for deciding whether a user’s new request submitted to the edge device should be answered directly by a small model, such as a small language model (SLM), which is already running on the edge, or by a larger model deployed on a cloud provider, or through a combination of the two. By applying this approach, we can significantly reduce querying and communication costs on the edge while maintaining overall model and query performance.

Refer to caption
Figure 5: Answering queries locally on the edge through an SLM or proxying to the cloud.

To provide further insights towards the materialization of this edge-to-cloud collaborative approach, in the heatmap shown in Figure 6, we record the number of test queries answered by each expert model for every routing method. From the reported values, it is apparent that both the MLP-Router and the BERT-Router, route most of the test queries investigated during our experimental analysis (section 4) to the Fox-1.6B small language model, which is similar to the behavior observed by the Optimal (oracle) approach. However, other approaches like the Random-Router and 1NN-Router, distribute almost equally the number of queries across all model experts.

If we assume that an SLM, like the Fox-1.6B model, is deployed on the edge, our analysis shows that by learning the embedding space of existing query prompts using an MLP or BERT-based routing approach, the majority of queries will be forwarded to the most suitable expert, which in this case is the SLM. As a result, deploying such routing methods on the edge enables edge devices to decide whether a query prompt should be routed to the local SLM or a cloud model expert, paving the way for the effective development of collaborative edge-to-cloud model routing techniques.

Refer to caption
Figure 6: Number of test queries allocated to each model expert by each routing method.

6 Conclusion

We present our multi-LLM routing system, called TO-Router, for the first time. With the TO-Router system, users can easily interact with multiple LLM expert models hosted on the same or different platform providers, without being limited to a single monolithic LLM system. By utilizing a routing method capable of learning the embedding space of query prompts, such as MLP- and BERT-based methods, users can benefit from significant cost savings (up to 30%), improved query response times (up to 40%), and enhanced model performance (up to 10%). As part of our immediate future plans, we aim to evaluate the feasibility of dynamically adding and removing model experts during the router’s endpoint deployment and test the routing efficacy of both small and large pre-trained language models.

References

  • Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  • Beeching et al. [2023] Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. Hugging Face, 2023.
  • Breiman [1996] L Breiman. Bagging predictors machine learning 24 (2), 123-140 (1996) 10.1023. A: 1018054314350, 1996.
  • Chen et al. [2023] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
  • Clark et al. [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Ding et al. [2024] Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618, 2024.
  • Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  • Hu et al. [2024] Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system. arXiv preprint arXiv:2403.12031, 2024.
  • Jiang et al. [2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Jin et al. [2019] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
  • Jordan and Jacobs [1994] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  • Labrak et al. [2024] Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
  • Ong et al. [2024] Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data. arXiv preprint arXiv:2406.18665, 2024.
  • Roziere et al. [2023] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  • Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  • Shazeer et al. [2017] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Shnitzer et al. [2023] Tal Shnitzer, Anthony Ou, M\́mathbf{i}rian Silva, Kate Soule, Yuekai Sun, Justin Solomon, Neil Thompson, and Mikhail Yurochkin. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023.
  • Srivatsa et al. [2024] KV Srivatsa, Kaushal Kumar Maurya, and Ekaterina Kochmar. Harnessing the power of multiple minds: Lessons learned from llm routing. arXiv preprint arXiv:2405.00467, 2024.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Yang et al. [2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671.
  • Zhang et al. [2019] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  • Zhou [2012] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.

Appendix A kNN-Router Diagram

Refer to caption
Figure 7: A flow diagram of the embedding similarity approach used by the 1NN-Router.

Appendix B Model Pricing

Model Type $$ / 1M Input Tokens $$ / 1M Output Tokens
DeepSeek-8B $0.14 $0.28
Fox-1.6B $0.20 $0.20
Llama-8B $0.20 $0.20
Mistral-8B $0.25 $0.25
Qwen-7B $0.20 $0.20
Table 2: Price per million input and output tokens for different types of model architectures.

Appendix C Router Models Data Preparation

To generate experts’ soft labels to train the MLP and BERT-Router models, we used the BERT similarity scores and set the temperature value of the softmax function to 10, i.e., T=10𝑇10T=10italic_T = 10. To compute the closest training query to a given test query in the case of the 1NN-Router, we compute the queries’ embeddings using the sentence transformer library. 333https://www.sbert.net/docs/quickstart.html

Appendix D Router Models Training Hyperparameters

The total number of experts is 7. The MLP-Router’s hidden layer size is 256. The random seed for all experiments is set to 42. The applied optimizer for training both the MLP and BERT routers is Adam with weight decay, the learning rate is set to 5e35𝑒35e-35 italic_e - 3 and 5e55𝑒55e-55 italic_e - 5, respectively. We also applied L2 norm regularization with λ=1e4𝜆1𝑒4\lambda=1e-4italic_λ = 1 italic_e - 4. The batch size is set to 8 and the total number of training (MLP model) and fine-tuning (BERT model) is set to 5 epochs. The BERT model for the router is bert-base-uncased. To counter dataset class/expert imbalance we observed while generating the training and testing datasets, i.e., an expert model might be more suitable to answer many more queries than other experts, we used a sample weighting function, with the weight of each sample being the inverse proportion count of samples per class in the entire training dataset, i.e., the total weight sample proportion for each class/expert i𝑖iitalic_i across all experts E𝐸Eitalic_E, is measured as wi=jE|Dj||Di|,iEformulae-sequencesubscript𝑤𝑖subscript𝑗𝐸subscript𝐷𝑗subscript𝐷𝑖for-all𝑖𝐸w_{i}=\tfrac{\sum_{j\in E}|D_{j}|}{|D_{i}|},\forall i\in Eitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_E end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG , ∀ italic_i ∈ italic_E, with the final weight value per training sample being equal to wi=wijE|wj|iEsubscript𝑤𝑖subscript𝑤𝑖subscript𝑗𝐸subscript𝑤𝑗for-all𝑖𝐸w_{i}=\tfrac{w_{i}}{\sum_{j\in E}|w_{j}|}\forall i\in Eitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_E end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG ∀ italic_i ∈ italic_E.

Appendix E Expert Model Resources

Below, we provide details regarding the internal architecture and type of models we used as our expert models in this study. For every instructed model, if not otherwise specified, we set the maximum tokens generation length to 512, the temperature to 0.7, and the top-p parameter to 0.95.

  • BioLlama-7B 444https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B:This model is an advanced Llama-3-based model designed specifically for the biomedical domain. With policy optimization and a custom medical instruction dataset, it outperforms even the ChatGPT API. Following the recommended parameters, we set max new tokens to 256, temperature to 0.1 and top-p to 0.9.

  • BioMistral-7B 555https://huggingface.co/BioMistral/BioMistral-7B: This Mistral-based model, pre-trained using textual data from PubMed Central Open Access, is well-suited for medical domains and achieves performance comparable to the ChatGPT API across all medical evaluation benchmarks.

  • CodeLlama-7B 666https://huggingface.co/codellama/CodeLlama-7b-hf: This model adapts the Llama-2-7B model with a large collection of code datasets, incorporating an infilling training objective and long input context subsets.

  • Fox-1.6B 777https://huggingface.co/tensoropera/Fox-1-1.6B-Instruct-v0.1: Fox-1 is a decoder-only transformer-based small language model with 1.6B parameters, developed by TensorOpera AI. Fox-1-Instruct-v0.1 is an instruction-tuned version with an 8K native context length, finetuned with 5B tokens of instruction-following and multi-turn conversation data.

  • Mistral-7B-Instruct 888https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2: This model is an officially released instruct fine-tuned version of the Mistral-7B-v0.2.

  • Qwen-7B-Instruct 999https://huggingface.co/Qwen/Qwen2-7B-Instruct: This model is an officially released instruct fine-tuned version of the Qwen2-7B.

  • MathDeepSeek-7B 101010https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct: This model, initialized with DeepSeek-Coder-v1.5 7B, continues pre-training on math-related tokens sourced from the web, achieving impressive scores on the competition-level MATH benchmark. Following the recommended parameters, we set max new tokens to 512, top-k to 50 and top-p to 0.95.