Abstract
Free full text
Detecting selection in immunoglobulin sequences
Abstract
The ability to detect selection by analyzing mutation patterns in experimentally derived immunoglobulin (Ig) sequences is a critical part of many studies. Such techniques are useful not only for understanding the response to pathogens, but also to determine the role of antigen-driven selection in autoimmunity, B cell cancers and the diversification of pre-immune repertoires in certain species. Despite its importance, quantifying selection in experimentally derived sequences is fraught with difficulties. The necessary parameters for statistical tests (such as the expected frequency of replacement mutations in the absence of selection) are non-trivial to calculate, and results are not easily interpretable when analyzing more than a handful of sequences. We have developed a web server that implements our previously proposed Focused binomial test for detecting selection. Several features are integrated into the web site in order to facilitate analysis, including V(D)J germline segment identification with IMGT alignment, batch submission of sequences and integration of additional test statistics proposed by other groups. We also implement a Z-score-based statistic that increases the power of detecting selection while maintaining specificity, and further allows for the combined analysis of sequences from different germlines. The tool is freely available at http://clip.med.yale.edu/selection.
INTRODUCTION
During the course of an immune response, B cells that initially bind antigen with low affinity through their immunoglobulin (Ig) receptor are modified through cycles of somatic hypermutation and affinity-dependent selection to produce high-affinity memory and plasma cells. This affinity maturation is a critical component of T cell-dependent adaptive immune responses. It helps guard against rapidly mutating pathogens and underlies the basis for many vaccines. Somatic hypermutation is a process unique to B cells responding to antigen that results in a mutation rate that is 7–8 orders of magnitude above normal background, thus introducing about one point mutation per cell per division in the Ig receptor (1,2). Understanding the somatic hypermutation process also has applications far beyond pathogen responses. It has been found to occur in autoimmune responses, and in several proto-oncogenes (3). We have also demonstrated that somatic hypermutation can act genome-wide and thus represents a risk for genomic instability (4).
The ability to detect selection from mutated Ig sequences is a critical part of many studies. Current methods are based on comparing the observed frequency of replacement (i.e. non-synonymous) mutations to their expected frequency under the null hypothesis of no selection (5–8). Elevated levels indicate positive selection, while decreased levels indicate negative selection with significance determined by a binomial test. It is common to look for negative selection in the framework regions (FWRs), which provide the structural backbone of the receptor, and positive selection in the complementary determining regions (CDRs), where most contact residues for antigen binding are found. As the intrinsic biases of somatic hypermutation can give the appearance of selection (9), a significant challenge for these methods is calculating the expected frequency of replacements under the null hypothesis of no selection. We previously developed the Focused binomial test for detecting selection that improved upon existing methods by fully accounting for microsequence specificity and base substitution bias in somatic hypermutation (10). The Focused test also corrects for the decrease in specificity due to cross-talk in other methods by using a carefully derived null model of mutation.
Our web site implementing the Focused test has been increasingly used by a large number of groups since the initial publication of the method in 2008 (10). Here, we present an improved web server that includes many of the most requested features from users, such as V(D)J germline segment identification with IMGT alignment and batch submission of sequences. For comparison, we also integrate results from the previously proposed Local binomial (5) and multinomial (7) tests. Note that we have shown that the multinomial test is mathematically equivalent to a much simpler binomial test, referred to in this article as the Global binomial test. Along with these features, we have implemented an improved Z-score-based statistic that increases sensitivity and allows for the combined analysis of multiple-independent Ig sequences or clones.
METHODS
Preparing and submitting sequences
The input consists of the mutated sequences to be analyzed along with their associated germlines. In many cases, the experimental results consist only of the mutated sequences so the first step in preparing the data is to define the germline sequence so that individual mutations can be identified. There are numerous databases, such as IMGT/GENE-DB (11) and VBASE2 (12), that provide curated lists of Ig V(D)J germline gene segments. There are also several online tools that can infer the most likely germline rearrangements, including: SoDA (13), iHMMune-align (14) and IMGT/V-QUEST (15). Separate tests are carried out for detecting selection in the CDR and FWR. By default, the web server assumes that the sequences are aligned to conform with the IMGT unique numbering system. This allows standard definitions of CDR and FWR to be used. However, this is not required and users can choose a custom numbering to define these regions. All the sequences in a single input use the same CDR and FWR boundaries. The user may enter or upload their sequences in FASTA format. Multiple sequences sharing the same germline can be grouped together by placing the germline sequence first followed by the test sequence(s). Multiple groups of sequences sharing different germline sequences can also be placed in a single input file by placing an additional ‘>’ at the start of the header line in all the germline sequences. Users also have the option of having the germline sequence(s) appear after each set of mutated sequences. To expedite the steps of germline determination and IMGT alignment, we provide an option for users to input a set of mutated Ig sequences without associated germlines. These sequences are processed using SoDA (13) to identify the V(D)J germline gene segments and align the input and germline using IMGT numbering. Users are given the option to download the resulting FASTA-formatted alignment as a text file or directly proceed to the selection analysis.
Analyzing sequences for detecting selection
Calculating the observed number of mutations
By comparing the input sequence to its associated germline, our program identifies the mutations and determines the number of replacement (R) and silent (S) mutations. Each mutation is considered independently in its germline context when determining whether it is an R or S. R and S mutations that fall into CDR and FWR are tabulated individually using the boundaries indicated by the user. A checkbox is provided to indicate that sequences are clonally related. In this case, each group of sequences associated with a germline is analyzed as a single unit and only unique mutations are used in the analysis. That is, the same base substitution occurring at the same position in multiple sequences is counted only once.
Estimating the expected frequencies of mutations
Having computed the observed number of mutations, the next step is to compute the expected number of mutations under the null hypothesis of no selection. Expectations are computed independently for each germline sequence in the input as previously described (10). A significant advantage of the Focused test over previous methods is that our null model fully accounts for the effects of microsequence specificity (16) and also introduces the well-characterized transition bias of somatic hypermutation (17,18). Briefly, the expected number of R mutations in the CDR () is the sum of the product of an individual point mutation falling in the CDR and the probability the base substitution results in an R mutation [Equation (1)]:
where i is summed over all positions in the region (i.e. CDR or FWR) and b over all possible nucleotides ({A,C,T,G}). In this equation, is a vector containing the nucleic content of each position in the germline sequence, is the mutability index for position i in germline [as explained in (10)], Ma→b is the relative rate in which nucleotide a mutates to b (while Ma→a=0) and is an indicator function that is 1 in cases where a mutation in position i from a to b results in a replacement mutation and 0 otherwise.
The binomial framework for detecting selection
The Local binomial, multinomial and Focused binomial tests for selection all determine whether the observed number of R mutations (x), in either the CDR or FWR, is significantly different than the expected number () out of n observed mutations. The expected frequency (x/n) under the null hypothesis of no selection is denoted by p. For x/n≤p (an indication of negative selection), the significance of the test is calculated as the probability of observing x or fewer R mutations by adding half the probability density function (P) at x to the cumulative distribution function (F) at (x−1). The P-value of x/n>p (an indication of positive selection) is one minus that of negative selection.
where , and Θ(x) is 1 for x>0 and 0 elsewhere. Table 1 defines these parameters for each implemented test.
Table 1.
Test | x | n | p |
---|---|---|---|
Focused | RCDR | RCDR+SCDR+SFWR | |
Local | RCDR | RCDR+SCDR | |
Global | RCDR | RCDR+SCDR+RFWR+SFWR |
Equivalent tests for detecting selection in the FWR are obtained by swapping CDR and FWR in the definitions.
Using Z-scores to compute P-values
The standard way of calculating P-value (PBinom) described in the previous section is conservative in the sense that the resulting specificity is often greater than the cutoff used to obtain it. This limits sensitivity and is particularly a problem when the total number of mutations per sequence is relatively low (~10), as is common for many Ig data sets, but is still significant for dozens of mutations. To correct this problem, we have implemented a Z-score-based method for computing the P-value. The Z score is defined as follows: for a random variable xi corresponding to the x for sequence or clone i as defined in Table 1, the associated zi score (which is itself a random variable) is defined as:
whereμi=pini and are the mean and standard deviation of xi, respectively. In order to obtain a P-value from the zi, we use the normal approximation and get:
where erf is the Gauss error function; the reader is referred to Figure 1 to see graphically the different P-values definitions [normal approximation and the traditional way of defining it for a binomial distribution, Equation (2)]. We call this the Focused Z test (Pz) to distinguish it from the Focused binomial test (PBinom).
Detecting selection in groups of sequences
Our web server also implements a test for detecting selection in a group of independent sequences. This can be helpful to improve sensitivity if the sequences do not exhibit statistically significant selection when analyzed individually. Previous tools did not allow for such analysis since grouping P-values computed by Equation (2) from sequences with different p's and n's cannot be done for a fixed P-value cutoff in a simple way. However, this can be done using the Z-score approach, through the application of Stouffer's Z-score method (19), which has proven to be superior to alternatives (20). The resulting P value tests whether the collection of observed mutations came from binomial distributions with associated probabilities pi. To analyze G-independent sequences, Z is defined as the mean of the previously computed Z-scores zi, i1…G from Equation (3). Since the xi's in this equation are independent, the expected mean and variance of Z are 0 and , respectively. Using the normal approximation once again, the resulting P value for detecting selection is:
By default, this test is applied to compute P values for each group of sequences associated with a single germline, as well as for the set of all sequences provided in the input. In order to compare with the original Focused binomial test, a checkbox allows the use of PBinom for individual sequences. If the user indicates that sequences are clonally related then the set of clones is analyzed as a group (in which case, G will be the number of germlines provided in the input).
PERFORMANCE RESULTS
We have previously shown, using a simulation-based validation strategy, that the Focused binomial test provides the best trade-off between sensitivity and specificity compared with other available methods (10). Furthermore, we found that the Focused binomial test is able to detect selection in experimentally derived Ig sequences that have undergone affinity maturation, while maintaining good specificity on non-functional Ig sequences where selection is not a force (10). An independent study has corroborated these findings (21). As described below, we use a simulation-based validation strategy to evaluate the performance characteristics of our updated Z-score-based methods.
Simulation of somatic hypermutation and selection
Simulation enables us to produce synthetic sequence data with a prescribed number of somatic mutations subject to varying amounts of positive and/or negative selection pressures. The simulation is initiated with a single IMGT formatted Ig V germline sequence. Mutations are introduced one-by-one along the entire length of the sequence (excluding gaps) in two steps. First, the position is chosen stochastically based on the microsequence specificity of each nucleotide. Second, the particular substitution is determined accounting for transition bias. Selection (ξ) is implemented separately in the CDR and FWR as uniformly increasing or decreasing by a log factor the expected probability of the expected frequency of R mutations in each region:
where pregion is the expected frequency of R mutations defined by the Local test (Table 1) and πregion is the actual probability applied to the simulated sequences in the region of interest (i.e. CDR or FWR). For example, ξCDR values of −1, 0 and 1 yield synthetic data with negative, neutral and positive selection in the CDR, respectively.
Using Z-scores improves sensitivity
To evaluate performance, a synthetic data set of mutated sequences was generated. This data set included 10000 sequences with strong positive selection in the CDR (ξCDR=1), and 10000 sequences with neutral selection in the CDR (ξCDR=0), allowing us to evaluate sensitivity and specificity, respectively. All sequences included strong negative selection in the FWR (ξFWR=−1). Each simulation starts with the IgHV7–3 germline sequence and introduces between 5 and 35 mutations per sequence, stochastically determined to reflect the numbers seen in experimental data. Results starting with other germline segments are similar. Plotting the fraction of sequences predicted as being positively selected for varying alpha cut-off values in the first data set against the second produces a receiver operating characteristic (ROC) curve. The ROC curves (Figure 2) for both the Focused binomial test (yellow) and the Focused Z test (red) are comparable, confirming the validity of the Z-score-based method. However, the position on the ROC curve where the α cutoff is 0.05 for the Focused binomial test (yellow X) falls short of the expected 1−specificity of 0.05, indicated by the dashed line. Applying the Z-score-based method corrects for this discrepancy leading to an improvement in sensitivity of ~10%, as shown by the red X marking the same position on the ROC curve for the Focused Z test. The accuracy of the Focused Z test at α=0.05 is 0.72 while it is 0.68 for the Focused binomial test. Along with the Focused test, a checkbox on our web site also allows users to include results from the Local and Global (also known as the multinomial) tests (Table 1). However, it is important to note that these tests, as originally proposed, did not fully include for microsequence specificity or substitution biases when computing the expected frequencies of mutations, whereas our implementation accounts for these intrinsic biases. In addition, P-values for all the approaches are calculated using the Z-score-based method. Only results from the Focused test are output by default since this method exhibits better sensitivity than the Local test. Furthermore, we strongly caution against using the Global test because it fails to maintain the defined specificity [Figure 3 and (10)], but nevertheless make its output available for comparative studies.
Grouping independent sequences improves senstivity
To see how analyzing sequences as a group would effect performance, we randomly sample 10000 groups of G sequences and compute the PZ,G for each group. This is representative of testing for selection acting collectively on groups of independent Ig sequences. Using these sets of grouped sequences, we produced the ROC curves shown in Figure 2 (inset). It is evident that combining even small numbers of independent sequences can dramatically increase sensitivity without compromising specificity.
IMPLEMENTATION
The web interface makes use of PHP, JavaScript, CSS and AJAX technologies. All the statistics are computed in the back-end using R version 2.9.0 (22). The web site may be accessed using any modern web browser, such as Mozilla Firefox, Google Chrome, Safari and Internet Explorer.
DISCUSSION
The ability to analyze mutation patterns in Ig sequences and detect antigen-driven selection is critical to understanding adaptive immunity. We have presented a web site that makes available our previously published Focused test, along with the Local and Global tests based on statistics proposed by other groups (5–7). Consistent with previous results from our own group and others (10,21), a simulation-based validation found that the Focused test exhibited the best performance. The web site offers an integrated pipeline where users can carry out V(D)J identification with IMGT alignment using SoDA (13), quantify the mutational load in their sequences and analyze the mutations for evidence of positive and/or negative selection (Figure 4). The ability to carry out batch processing and analyze related sequences as a single group will be critical to gain insights from large-scale data sets generated by emerging techniques such as expression cloning (23) and deep sequencing of B cell repertoires (24).
FUNDING
National Institutes of Health (R03AI092379-01) (in part). Funding for open access charge: National Institutes of Health (R03AI092379-01).
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
We would like to thank Thomas Kepler and Supriya Munshaw for providing us with SoDA.
REFERENCES
Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/nar/gkr413
Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/nar/article-pdf/39/suppl_2/W499/7630067/gkr413.pdf
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/101864831
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1093/nar/gkr413
Article citations
Partial RAG deficiency in humans induces dysregulated peripheral lymphocyte development and humoral tolerance defect with accumulation of T-bet+ B cells.
Nat Immunol, 23(8):1256-1272, 28 Jul 2022
Cited by: 14 articles | PMID: 35902638 | PMCID: PMC9355881
Human B cell lineages associated with germinal centers following influenza vaccination are measurably evolving.
Elife, 10:e70873, 17 Nov 2021
Cited by: 26 articles | PMID: 34787567 | PMCID: PMC8741214
Post-Transformation IGHV-IGHD-IGHJ Mutations in Chronic Lymphocytic Leukemia B Cells: Implications for Mutational Mechanisms and Impact on Clinical Course.
Front Oncol, 11:640731, 25 May 2021
Cited by: 12 articles | PMID: 34113563 | PMCID: PMC8186829
Combining phage display with SMRTbell next-generation sequencing for the rapid discovery of functional scFv fragments.
MAbs, 13(1):1864084, 01 Jan 2021
Cited by: 9 articles | PMID: 33382949 | PMCID: PMC7781620
Using B cell receptor lineage structures to predict affinity.
PLoS Comput Biol, 16(11):e1008391, 11 Nov 2020
Cited by: 8 articles | PMID: 33175831 | PMCID: PMC7682889
Go to all (59) article citations
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS.
Methods Mol Biol, 882:569-604, 01 Jan 2012
Cited by: 314 articles | PMID: 22665256
Quantifying selection in high-throughput Immunoglobulin sequencing data sets.
Nucleic Acids Res, 40(17):e134, 27 May 2012
Cited by: 129 articles | PMID: 22641856 | PMCID: PMC3458526
Integrating B cell lineage information into statistical tests for detecting selection in Ig sequences.
J Immunol, 192(3):867-874, 27 Dec 2013
Cited by: 25 articles | PMID: 24376267 | PMCID: PMC4363135
Affinity war: forging immunoglobulin repertoires.
Curr Opin Immunol, 57:32-39, 25 Jan 2019
Cited by: 7 articles | PMID: 30690255 | PMCID: PMC6511487
Review Free full text in Europe PMC
Funding
Funders who supported this work.
NIAID NIH HHS (4)
Grant ID: R03 AI092379-02
Grant ID: R03 AI092379-01
Grant ID: R03 AI092379
Grant ID: R03AI092379-01