Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of protein-coding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for among-site variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution by using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.

Free full text 


Logo of pnasLink to Publisher's site
Proc Natl Acad Sci U S A. 2006 Apr 18; 103(16): 6263–6268.
Published online 2006 Apr 10. https://doi.org/10.1073/pnas.0508279103
PMCID: PMC1458866
PMID: 16606848

A Dirichlet process model for detecting positive selection in protein-coding DNA sequences

Abstract

Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of protein-coding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for among-site variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution by using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.

Natural selection leaves a detectable signature in comparisons of protein-coding DNA sequences; a bias in the ratio of the rates of nonsynonymous and synonymous substitutions is unambiguous evidence for natural selection. Purifying selection causes the rate of nonsynonymous substitution to be smaller than the rate of synonymous substitution. In fact, the predominant pattern found in analysis of alignments of protein-coding DNA is that nonsynonymous substitutions have a lower rate than synonymous substitutions. This finding is consistent with natural selection acting to eliminate deleterious mutations that change the protein to a less functional form. However, natural selection can also act to increase the probability that a nonsynonymous mutation is fixed in the population. Positive, or direction, selection causes the relative rates of nonsynonymous to synonymous substitutions to be >1. Examples of positive selection have been found in many genes but perhaps most famously in human major histocompatibility complex (MHC) (1), HIV-1 envelope (env) gene (2), sperm lysins (3), and primate stomach lysozymes (4).

Although seemingly simple, detecting positive natural selection from alignments of protein-coding DNA is recognized as a formidable statistical problem. Many methods have been proposed to detect the footprint of natural selection, all of which are based on measuring the relative rates or numbers of nonsynonymous and synonymous substitutions (2, 520). Most of the methods assume a constant rate of nonsynonymous and synonymous substitutions across the sequence. These methods are ill-suited for detecting positive natural selection when only a few of the amino acid positions in a gene are under the influence of positive natural selection, with the others under purifying selection. Applying a method that assumes a constant rate of nonsynonymous change across the sequence potentially masks the signature of positive natural selection that is present at only a few positions in the alignment (21). In cases where such methods have been successful, many sites are typically under strong positive selection (e.g., MHC). More recently, Nielsen and Yang (2) developed a method that allows the rate of nonsynonymous substitution to vary across the sequence. The method of Nielsen and Yang has proven useful for detecting positive natural selection in sequences where only a few sites are under directional selection.

The Nielsen and Yang (2) approach assumes that the nonsynonymous/synonymous rate ratio (dN/dS ratio) at a site is a random variable drawn from some probability distribution. The goal is to estimate parameters of the model from alignments of protein-coding DNA and, by using these parameter estimates, identify amino acids under positive selection. Their model is quite complicated and contains many parameters to estimate. Some of the parameters account for the fact that the sequences are related to one another through some unknown phylogenetic tree. Other parameters account for biases in the substitution process, such as an increased rate of mutation for transitions. The remaining parameters, however, describe how dN/dS varies across the sequence. Nielsen and Yang showed not only that it is practical to efficiently estimate these parameters from alignments of protein-coding DNA sequences but also that one can use an empirical Bayesian approach to detect specific amino acid residues that are under the influence of positive natural selection.

How should variation in the rate of nonsynonymous substitution across a sequence be modeled? Population genetics theory is largely silent on this issue. For the most part, we lack information on the distribution of selection coefficients for new mutations and the demography of the populations under consideration, making it difficult to predict a distribution for the rate of nonsynonymous substitution (22, 23). The original Nielsen and Yang (2) approach assumed that a site could be in one of three categories, each of which differed in its dN/dS rate ratio. With probability p1, a site is in category 1, which has dN/dS = 0; with probability p2, the site is in category 2, which has dN/dS = 1; and with probability p3, the site is in a category that has dN/dS > 1 (p1 + p2 + p3 = 1). Nielsen and Yang (2) also considered a model that allowed dN/dS to vary continuously on the interval (0, 1); a gamma distribution, truncated on the interval (0. 1) was used to model amino acid sites that are acting neutrally or under the influence of purifying natural selection. Under this model, a site has dN/dS drawn from a truncated gamma distribution with probability p1 or has dN/dS > 1 with probability p2. Later, Yang et al. (12) systematically explored more ways to model variation in dN/dS across a sequence. They considered a total of 13 models. The “M10” model from Yang et al. (12), for example, assumes that, with probability p1, the dN/dS rate ratio is drawn from a beta distribution on the interval (0, 1) and, with probability p2, the dN/dS rate ratio is drawn from an offset gamma distribution. Even though many of the models considered in Nielsen and Yang (2) and in Yang et al. (12) describe dN/dS as varying continuously, in practice, these continuous distributions are discretized to allow the likelihood to be calculated (the likelihood is averaged over the different discrete categories). At least as currently implemented, then, all of these models describing how the rate of nonsynonymous substitution varies across the sequence are discrete.

We take a different approach to modeling variation in dN/dS across sites, allowing sites to be in one of a number of classes, with each class having a different dN/dS ratio. The prior probability distribution for the number of classes and the dN/dS for each class is described by a Dirichlet process prior. The Dirichlet process prior provides a flexible way to model situations in which the data elements are drawn from a mixture of simpler parametric distributions. For typical mixture models, the number of mixture components is assumed to be known, and determining the correct number of components a priori for a particular model is difficult. For the Dirichlet process prior, however, the number of mixture components is countably infinite, obviating the need to determine the correct number of mixture components. Here we apply the Dirichlet process to the problem of detecting positive selection in protein-coding DNA sequences. Instead of assuming that the dN/dS rate ratio for a site is drawn from a particular parametric distribution, as is the case for Nielsen and Yang (2) and Yang et al. (12), we assume that a site is assigned to a category with a particular dN/dS value. The number of selection categories and the dN/dS value for each category are both considered random variables in our model. The Dirichlet process has been used in one other case for an evolutionary problem: Lartillot and Philippe (24) used the Dirichlet process to model variation in the substitution process across alignments of amino acid sequences. Similarly, Pond and Frost (25) described a flexible discretization scheme that is able to fit a wide range of rate distributions. However, this scheme still requires that the maximum number of rate classes be specified a priori.

We estimate the parameters of the model in a Bayesian framework. Calculating the joint posterior probability distribution of the parameters involves summation over all possible phylogenetic trees relating the sequences and, for each tree, integration over all possible combinations of parameter values. We use Markov chain Monte Carlo (MCMC) to approximate the posterior probability distribution of the parameters and apply the method to six alignments of protein-coding DNA sequences (12, 17, 26, 27).

Results and Discussion

The Dirichlet process prior has two components: One is a parameter, usually called α, that influences the probability that data elements find themselves in the same cluster. In other words, the parameter controls the “clumpiness” of the process. The other component of a Dirichlet process prior is a probability distribution that describes the probability of the parameter assigned to each cluster. (The Dirichlet process prior is described in more detail in Materials and Methods.) We examined the robustness of the inferences of positive selection to three different choices for α, choosing α such that the prior mean of the number of components (k) was E(k) ≈ 1, E(k) = 5, and E(k) = 10; we did this to keep the number of selection classes to a manageable number because the likelihood calculations become too computationally cumbersome when k is large (e.g., k > 25). An alternative that we did not explore is to place a prior probability distribution on α. Often in problems using a Dirichlet process prior, a gamma hyperprior is placed on α.

Tables 116 compare the posterior and prior probability distributions for the number of mixture components (selection categories) for each of the six alignments. For all of the alignments we examined, there is very little posterior probability for k = 1 and k = 2, even when there may be a substantial amount of prior probability for those cases. The data are very difficult to explain with only a few selection classes.

Table 1.

Prior and posterior probabilities for the number of selection classes, k, for HIV-1 vif

kE(k) ≈ 1 E(k|X) = 3.04
E(k) = 5.00 E(K|X) = 6.33
E(k) = 10.00 E(k|X) = 10.21
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.94340.00000.01390.00000.00000.0000
20.05500.00000.06480.00000.00040.0000
30.00150.96340.14390.03340.00260.0002
40.00000.03630.20400.10960.00960.0027
50.00000.00020.20860.19290.02560.0134
60.00000.00000.16470.22920.05280.0371
70.00000.00000.10500.19670.08790.0751
80.00000.00000.05560.12880.12160.1225
90.00000.00000.02510.06670.14320.1561
100.00000.00000.00980.02860.14600.1630
>100.00000.00000.00470.01430.38280.4085

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

Table 2.

Prior and posterior probabilities for the number of selection classes, k, for HIV-1 pol

kE(k) ≈ 1 E(k|X) = 3.22
E(k) = 5.00 E(k|X) = 5.99
E(k) = 10.00 E(k|X) = 9.58
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.92850.00000.01570.00000.00010.0000
20.06900.00060.06860.00000.00060.0000
30.00250.78370.14590.03130.00330.0006
40.00010.20960.20140.15180.01140.0055
50.00000.00620.20360.23160.02850.0252
60.00000.00010.16110.24670.05570.0607
70.00000.00000.10410.16880.08910.1018
80.00000.00000.05660.09790.11990.1404
90.00000.00000.02650.04670.13880.1615
100.00000.00000.01080.01610.14050.1599
>100.00000.00000.00570.00930.37970.3349

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

Table 3.

Prior and posterior probabilities for the number of selection classes, k, for HIV-1 env

kE(k) ≈ 1 E(k|X) = 2.16
E(k) = 5.00 E(k|X) = 6.06
E(k) = 10.00 E(k|X) = 10.61
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.95050.00000.01240.00000.00000.0000
20.04830.85320.06130.00390.00030.0000
30.00120.13500.14150.04300.00200.0001
40.00000.01170.20570.13280.00820.0014
50.00000.00010.21290.21470.02320.0085
60.00000.00000.16820.23440.05020.0265
70.00000.00000.10600.18200.08670.0600
80.00000.00000.05500.10690.12330.1024
90.00000.00000.02400.05100.14760.1399
100.00000.00000.00900.02210.15130.1635
>100.00000.00000.00390.00910.38440.4688

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

Table 4.

Prior and posterior probabilities for the number of selection classes, k, for human influenza virus hemagglutinin

kE(k) ≈ 1 E(k|X) = 2.24
E(k) = 5.00 E(k|X) = 5.50
E(k) = 10.00 E(k|X) = 9.46
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.93830.00000.01490.00000.00000.0000
20.05980.77440.06730.00480.00050.0000
30.00180.21460.14600.07750.00290.0020
40.00000.01070.20370.19630.01030.0062
50.00000.00030.20640.26700.02670.0244
60.00000.00000.16250.21240.05400.0580
70.00000.00000.10370.13550.08830.1156
80.00000.00000.05530.06200.12080.1564
90.00000.00000.02520.03010.14120.1720
100.00000.00000.01000.01040.14360.1571
>100.00000.00000.00500.00380.38200.2962

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

Table 5.

Prior and posterior probabilities for the number of selection classes, k, for Japanese encephalitis virus env

kE(k) ≈ 1 E(k|X) = 2.01
E(k) = 5.00 E(k|X) = 3.58
E(k) = 10.00 E(k|X) = 5.44
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.93440.00000.01490.00000.00010.0000
20.06350.98490.06690.18320.00060.0223
30.00210.01460.14450.34970.00320.1089
40.00000.00050.20170.26270.01100.1863
50.00000.00000.20520.13650.02810.2201
60.00000.00000.16270.05000.05560.2066
70.00000.00000.10490.01390.08970.1348
80.00000.00000.05680.00320.12130.0721
90.00000.00000.02630.00080.14060.0316
100.00000.00000.01060.00000.14210.0117
>100.00000.00000.00540.00000.37740.0057

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

Table 6.

Prior and posterior probabilities for the number of selection classes, k, for vertebrate β-globin

kE(k) ≈ 1 E(k|X) = 3.03
E(k) = 5.00 E(k|X) = 5.65
E(k) = 10.00 E(k|X) = 8.95
f(k)f(k|X)f(k)f(k|X)f(k)f(k|X)
10.94620.00000.01320.00000.00000.0000
20.05250.00000.06300.00000.00040.0000
30.00140.97160.14220.05240.00240.0008
40.00000.02800.20390.18110.00920.0094
50.00000.00040.21000.26870.02500.0323
60.00000.00000.16640.23430.05240.0816
70.00000.00000.10600.14740.08810.1395
80.00000.00000.05600.07320.12270.1767
90.00000.00000.02500.03070.14490.1797
100.00000.00000.00970.00860.14770.1493
>100.00000.00000.00470.00350.38150.2276

The prior and posterior probabilities for the number of selection classes are f(k) and f(k|X), respectively.

We also examined the marginal posterior probability of ω = dN/dS for each site. Fig. 1 shows the probability distribution for four arbitrarily chosen sites from the HIV-1 env alignment. Three things are apparent: First, and as expected, different sites for the same data set have different probability distributions for ω. Second, the distribution for ω varies depending on the value of the parameter used in the analysis. The distributions were similar when the prior expectation of the number of selection classes was 5 or 10 but were different when the prior expectation was 1. In short, when the model attempts to explain the data with too few selection categories, the marginal posterior probability distribution for ω at a site becomes a compromise between the necessities for that site and others that are grouped into the same selection category. Third, and perhaps more importantly, the marginal posterior probability distribution of ω for a site can be used to directly calculate the probability that the site experiences positive selection. The probability that ω > 1 is the probability that the site is under positive selection. This quantity can be directly calculated from the output of the MCMC procedure as the fraction of the time the site had ω > 1. Fig. 2 shows the probability that each site is under positive selection for all of the analyses of this paper. There is strong concordance between the sites that we find to be under positive selection with the sites that Yang et al. (12) found to be under positive selection. Yang et al. (12) found that analysis of different probability distributions describing how ω varies across a sequence would typically result in a set of sites that always were found to be under positive selection and another set of sites that were inferred to be under positive selection for some models but not others. In general, the Dirichlet process prior picks out the sites that were consistently found to be under positive selection by Yang et al. (12) but assigns lower probabilities to these same sites of being under positive selection. For example, the Yang et al. (12) analysis of the HIV-1 env alignment consistently found sites 28, 66, and 87 to be under positive selection (with probabilities of 0.99 or greater), and sites 26 and 51 to be under positive selection with probability 0.99 under the M12 model. (The M12 model is a mixture of two normal distributions with a discrete category with ω = 0.) Our analysis of the HIV-1 env alignment finds sites 26, 28, 51, 66, 83, and 87 to be under positive selection, all having a probability of >0.95 and having ω > 1. Our analysis does not condition on the maximum likelihood values of the parameters (the tree, branch lengths, and substitution model parameters) as is the case of the Nielsen and Yang (2) approach. It is likely that the accommodation of uncertainty in the model parameters causes the probabilities of sites being in particular categories to be dampened relative to approaches that do not account for parameter uncertainty.

An external file that holds a picture, illustration, etc.
Object name is zpq0150618900001.jpg

The posterior probability density distribution of ω, the dN/dS rate ratio, for four sites of the HIV-1 env alignment. The lightly shaded portion of each distribution is the part that has ω > 1.

An external file that holds a picture, illustration, etc.
Object name is zpq0150618900002.jpg

The posterior probabilities of sites being under positive selection for each of the analyses of the six alignments of this study. The graphs are grouped by alignment, with each group consisting of three graphs. The top graph of each group has E(k) ≈ 1, the middle graph has E(k) = 5, and the bottom graph has E(k) = 10.

Table 7 lists sites that had a high probability of being under positive selection for all six genes. For the most part, the same sites are found to be under positive selection regardless of the value of the concentration parameter used in the analysis. For example, sites 26, 28, 51, 66, 83, and 87 of the HIV-1 env alignment were inferred to be under positive selection regardless of the value of α assumed in the analysis. Sites 24, 68, 69, and 76 had a probability >0.95 of being under positive selection when E(k) ≈ 1 but not when E(k) = 5 or E(k) = 10. However, the probability of those sites being under positive selection was just below the 0.95 threshold. (Sites 24, 68, 69, and 76 had probabilities ranging between 0.88 and 0.93 of having ω > 1 when the expected number of selection categories was set to 5 or 10.)

Table 7.

Sites potentially under positive selection

DataE(k)Sites with probability >0.95 of being under positive selection
Vertebrate β-globin1
5
10
Japanese encephalitis virus env1
5
10
Human influenza virus hemagglutinin1
5226, 135
10226, 135
HIV-1 env128, 66, 26, 87, 51, 83, 76, 69, 68, 24
528, 66, 26, 87, 83, 51
1028, 66, 26, 87, 83, 51
HIV-1 pol167, 347, 478, 779, 568, 761
567, 347, 779, 478, 3, 568
1067, 347, 779, 478, 3, 568
HIV-1 vif133, 167, 33, 127, 39, 109, 122, 47, 92, 37
533, 167, 127, 31, 37, 109, 39, 122, 92, 47, 63
1033, 127, 167, 31, 37, 109, 122, 39, 92, 47

Methods for detecting the presence of positive natural selection in protein-coding DNA have become an important tool in studies of molecular evolution. The recent advances that allow the nonsynonymous/synonymous rate ratio to vary across the sequence have opened up the possibility of detecting specific amino acid residues that are functionally important, displaying an elevated dN/dS rate ratio. The method we describe here represents an important extension of existing methods by allowing a more flexible description of how dN/dS varies across a sequence and by accounting for uncertainty in parameters of the model when making inferences of positive selection.

Materials and Methods

Data.

We assume an alignment of protein-coding DNA sequences is available. The alignment is contained in the matrix X = {xij}, where i = 1, 2, …, s and j = 1, 2, …, c; s is the number of sequences, and c is the number of codons in each sequence. The information at the jth site is contained in the vector xj = (x1j, x2j, …, xsj)′.

Phylogenetic Model.

The s species are related to one another through an unknown phylogeny, represented as a bifurcating tree. Possible phylogenies are denoted τ1, τ2, …, τB(s), where B(s) is the number of possible phylogenetic trees for s species. We assume a time-reversible model of nucleotide substitution (see below), which means that the phylogenetic trees can be arbitrarily rooted without changing the likelihood. Therefore, we assume unrooted phylogenetic trees throughout this study. The number of unrooted phylogenetic trees for s species is B(s) = (2s − 5)!/[2s−3(s − 3)!] (28). Each unrooted tree has 2s − 3 branches. Each branch on the tree has a length, ν. The set of branch lengths for the jth tree is denoted νj = (ν1, ν2, …, ν2s−3).

We assume that substitutions occur on the phylogenetic tree according to a continuous-time Markov process. The instantaneous rate of change from codon i to codon j is contained in a matrix, Q, the entries of which are given by

equation image

where ω is the nonsynonymous/synonymous rate ratio, κ is the transition/transversion rate ratio, and πj is the stationary frequency of codon j (2, 14, 15). The rate matrix, Q, is 61 × 61 in dimension; although there are 64 possible codon triplets, the three termination codons of the universal code are excluded from the state space, leaving the 61 sense codons as the states of the substitution process. The diagonal elements of the rate matrix Q are specified such that each row sums to 0. We rescale the matrix such that the average rate of synonymous substitution is 1; this means that the branch lengths are in terms of expected number of synonymous substitutions per codon site. This scaling is different from the usual for phylogenies, where the rate matrix is rescaled such that the branch lengths are in terms of expected number of substitutions per site. Finally, note that the substitution process is time-reversible because πiqij = πjqji for all ij. Practically speaking, this means that we can use unrooted phylogenies instead of rooted trees when calculating the likelihood on the tree (29). The transition probabilities of the process are the probability of finding the process in codon j conditioned on the process starting in codon i and run over a branch of length ν. The transition probabilities can be calculated as P(ν) = e.

Each of the c sites can be in one of k selection categories, where each category differs in its dN/dS value. The information on the category membership is contained in an association vector, z. For example, for the alignment of four sites given above, one possible assignment of sites to selection categories is z = (1, 1, 2, 3). Here, k = 3, and the first two sites are in the same category. The allocation vector describes a partition of the sites into different classes, each class with its own dN/dS value. The number of ways to partition a set of c sites into k classes is described by the Stirling numbers of the second kind (30):

equation image

The sites for the example alignment of c = 4 sites can be assigned to k = 3 categories in a total of S(4, 3) = 6 ways: z = (1, 1, 2, 3); z = (1, 2, 1, 3); z = (1, 2, 3, 1); z = (1, 2, 2, 3); z = (1, 2, 3, 2); and z = (1, 2, 3, 3). We label possible partitions of the sites into classes by using the restricted growth function notation of Stanton and White (31), where elements are sequentially numbered with the constraint that the index numbers for two sites are the same if they are found in the same category. The number of sites in the ith category is denoted ηi. For the example given above, η1 = 2, η2 = 1, and η3 = 1.

Likelihood.

Assuming that substitutions are independent across sites, the likelihood for an alignment can be calculated as the product of the site likelihoods:

equation image

We used the Felsenstein (29) pruning algorithm to calculate likelihoods for specific combinations of parameter values.

Statistical Model.

We estimate parameters in a Bayesian framework. All parameters of the model are considered random values with their own prior and posterior probability distributions. The joint posterior probability of all of the parameters of the model is

equation image

The posterior probability [f(· |X)] is equal to the product of the likelihood [L(·) or, equivalently, f(X| ·)] and the prior probability distribution [f(·)] of the parameters divided by the marginal likelihood [f(X)].

Bayesian analysis requires that a prior probability distribution be specified for the parameters. Here we assume that the parameters take the prior probability distributions shown in Table 8.

Table 8.

The prior probability distributions assigned to parameters of the phylogenetic model

ParameterSymbolPrior
Tree topoologyτAll trees have equal prior probability
Branch lengthsνBranch lengths are independent exponential (10) random variables
Codon frequenciesπFlat Dirichlet distribution
Transition/transversion ratioκRatio of two identically distributed exponential random variables
dN/dS rate ratioωRatio of two identically distributed exponential random variables
Category informationz,kDirichlet process
Dirichlet process parameterαFixed such that E(k) is small [E(k) ≈ 1, E(k) = 5, E(k) = 10]

The model, as described, is fairly typical of codon models implemented to date (32). However, the Dirichlet process prior is new to the problem of detecting positive natural selection and requires more explanation. The Dirichlet process prior is widely used in clustering problems as a nonparametric alternative, where the data elements are drawn from a mixture distribution with a countably infinite number of components (33, 34). [Ewens and Tavaré (35) describe the relationship between the Dirichlet process prior and the Ewen’s sampling formula from population genetics.] The Dirichlet process prior has two parameters: a concentration parameter, which, loosely speaking, determines the degree to which data elements are grouped together, and a base measure, G0(·), which describes the probability distribution of the parameter value associated with each cluster. The model can be described as follows: First, the number of classes, k, and an allocation variable, z, are randomly drawn from the probability distribution

equation image

Once the sites have been partitioned into k categories, a ω = dN/dS value is assigned to each category by drawing randomly from the following probability distribution k times, G0i) = f(ωi) = [1/(1 + ωi)2] which is the probability density of the ratio of two identically distributed exponential random variables. Here, ωi is the dN/dS rate ratio for the ith selection category. The probability of having k categories is obtained by summing over all possible partitions for k categories, and is

equation image

where nak is the absolute values of the Stirling numbers of the first kind. The expected number of categories is

equation image

Finally, the probability of finding two sites grouped together into the same cluster is f(zi = zj|α, c) = 1/(1 + α)].

MCMC.

We approximated posterior probabilities by using MCMC (36, 37). The general idea is to construct a Markov chain that has as its state space the parameters of the statistical model and a stationary distribution that is the posterior probability of the parameters. Samples drawn from the Markov chain when at stationarity are valid, albeit dependent, draws from the posterior probability distribution of the parameters (38). The fraction of the time the Markov chain visits any particular combination of parameter values is an approximation of the joint posterior probability of those parameters.

We wrote a computer program that implements the MCMC method for the phylogenetic model described above. For each iteration of the Markov chain, the program randomly picks a parameter to change, proposes a new value for the parameter, and then accepts or rejects the proposed state as the next state of the chain by using the Metropolis–Hastings formula. Most of the proposal mechanisms have been described elsewhere. We use the LOCAL mechanism of Larget and Simon (39) to simultaneously change the tree and branch lengths (τ, ν). We update the codon frequencies (π) by using a Dirichlet proposal mechanism and change the transition/transversion rate ratio (κ) and the category dN/dS values (ω) by using a rate multiplier [i.e., we multiply the old parameter value by eλ(u−½) to obtain a new value for the parameter, where u is a uniform(0, 1) random number and λ is a tuning parameter]. Finally, we update the allocation vector by using a Gibbs sampling algorithm for nonconjugate models proposed by Neal (40). We implement the method of Neal (40) as follows: First, we randomly choose a site, i, and remove it from the set of k selection classes currently in computer memory. If the site is in a selection class by itself (i.e., ηzi for the class was 1), then the entire class is deleted, and k is decreased by 1. Otherwise, ηzi for the selection class that the site is associated with is decreased by 1. We then construct five new (temporary auxiliary) classes by drawing the dN/dS value for each from the prior probability distribution. These auxiliary classes represent components that have no other sites associated with them. Gibbs sampling is performed to update site i but with respect to the distribution that includes the temporary auxiliary parameters. The site is assigned to the jth previously existing class with probability Zη−if(xij) and to the jth of five auxiliary classes with probability Z(α/5)f(xij), where Z is the appropriate normalizing constant. After this update, all dN/dS values not associated with a selection class are discarded.

All analyses were run for a total of 2 million update cycles. Chains were thinned, with samples taken every 100 cycles. Samples taken during the first 100,000 cycles were discarded as the burn-in phase. The MCMC procedure was repeated, resulting in two sets of samples for each gene. Convergence was assessed by using the program tracer (http://evolve.zoo.ox.ac.uk/software.html?id=tracer).

Data Analysis.

We analyzed six alignments of protein-coding DNA sequences. These alignments included: (i) β-globin gene sequences from s = 17 vertebrates (12); (ii) s = 23 env gene sequences from Japanese encephalitis virus (12, 26); (iii) s = 28 sequences of the HA1 domain of the hemagglutinin gene of human influenza virus A (17); (iv) s = 13 sequences of the HIV-1 env gene V3 region (27); (v) HIV-1 pol gene sequences from s = 23 isolates (12); and (vi) s = 29 HIV-1 vif gene sequences (12).

Acknowledgments

J.P.H. was supported by National Institutes of Health Grant GM-069801 and National Science Foundation Grants DEB-0445453 and CACTTOL-22035A.

Abbreviation

MCMCMarkov chain Monte Carlo.

Footnotes

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.

References

1. Hughes A., Nei M. Nature. 1988;335:167–170. [Abstract] [Google Scholar]
2. Nielsen R., Yang Z. Genetics. 1998;148:929–936. [Europe PMC free article] [Abstract] [Google Scholar]
3. Lee Y. H., Ota T., Vaquier V. Mol. Biol. Evol. 1995;12:231–238. [Abstract] [Google Scholar]
4. Messier W., Stewart C. B. Nature. 1997;385:151–154. [Abstract] [Google Scholar]
5. Miyata T., Yasunaga T. J. Mol. Evol. 1980;16:23–36. [Abstract] [Google Scholar]
6. Nei M., Gojobori T. Mol. Biol. Evol. 1986;3:418–426. [Abstract] [Google Scholar]
7. Li W. H., Wu M., Luo C. L. Mol. Biol. Evol. 1985;2:150–174. [Abstract] [Google Scholar]
8. Li W. H. J. Mol. Evol. 1993;36:96–99. [Abstract] [Google Scholar]
9. Pamilo P., Bianchi N. O. Mol. Biol. Evol. 1993;10:271–281. [Abstract] [Google Scholar]
10. Comeron J. M. J. Mol. Evol. 1995;41:1152–1159. [Abstract] [Google Scholar]
11. Ina Y. J. Mol. Evol. 1995;40:190–226. [Abstract] [Google Scholar]
12. Yang Z., Nielsen R., Goldman N., Pedersen A. Mol. Biol. Evol. 2000;17:32–43. [Abstract] [Google Scholar]
13. Yang Z., Nielsen R. Genetics. 2000;155:431–449. [Europe PMC free article] [Abstract] [Google Scholar]
14. Goldman N., Yang Z. Mol. Biol. Evol. 1994;11:725–736. [Abstract] [Google Scholar]
15. Muse S. V., Gaut B. S. Mol. Biol. Evol. 1994;11:715–724. [Abstract] [Google Scholar]
16. Yang Z. Mol. Biol. Evol. 1998;15:568–573. [Abstract] [Google Scholar]
17. Fitch W. M., Bush R. M., Bender C. A., Cox N. J. Proc. Natl. Acad. Sci. USA. 1997;94:7712–7718. [Europe PMC free article] [Abstract] [Google Scholar]
18. Suzuki Y., Gojobori T. Mol. Biol. Evol. 1999;16:1315–1328. [Abstract] [Google Scholar]
19. Pond S. K., Muse S. V. Mol. Biol. Evol. 2005;22:2375–2385. [Abstract] [Google Scholar]
20. Pond S. L. K., Frost S. D. W. Mol. Biol. Evol. 2005;22:1208–1222. [Abstract] [Google Scholar]
21. Crandall K. A., Kelsey C. R., Imamichi H., Lane H. C., Salzman N. P. Mol. Biol. Evol. 1999;16:372–382. [Abstract] [Google Scholar]
22. Sawyer S., Hartl D. Genetics. 1992;132:1161–1176. [Europe PMC free article] [Abstract] [Google Scholar]
23. Nielsen R., Yang Z. Mol. Biol. Evol. 2003;20:1231–1239. [Abstract] [Google Scholar]
24. Lartillot N., Philippe H. Mol. Biol. Evol. 2004;21:1095–1109. [Abstract] [Google Scholar]
25. Pond S. L. K., Frost S. D. W. Mol. Biol. Evol. 2005;22:223–234. [Abstract] [Google Scholar]
26. Zanotto P. M., Kallas E. G., Souza R. F., Holmes E. C. Genetics. 1996;153:1077–1089. [Europe PMC free article] [Abstract] [Google Scholar]
27. Leitner T., Kumar S., Albert J. J. Virol. 1997;71:4761–4770. [Europe PMC free article] [Abstract] [Google Scholar]
28. Schröder E. Z. Math. Phys. 1870;15:361–376. [Google Scholar]
29. Felsenstein J. J. Mol. Evol. 1981;17:368–376. [Abstract] [Google Scholar]
30. Abramowitz M., Stegun I. A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover; 1972. [Google Scholar]
31. Stanton D., White D. Constructive Combinatorics. New York: Springer; 1986. [Google Scholar]
32. Huelsenbeck J. P., Dyer K. A. J. Mol. Evol. 2004;58:661–672. [Abstract] [Google Scholar]
33. Ferguson T. S. Ann. Stat. 1973;1:209–230. [Google Scholar]
34. Antoniak C. E. Ann. Stat. 1974;2:1152–1174. [Google Scholar]
35. Ewens W. J., Tavaré S. In: Encyclopedia of Statistical Science. Kotz S., Read C. B., Banks D. L., editors. New York: Wiley; 1998. pp. 230–234. [Google Scholar]
36. Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E. J. Chem. Phys. 1953;21:1087–1092. [Google Scholar]
37. Hastings W. K. Biometrika. 1970;57:97–109. [Google Scholar]
38. Tierney L. In: Markov Chain Monte Carlo in Practice. Gilks W. R., Richardson S., Spiegelhalter D. J., editors. New York: Chapman & Hall; 1996. pp. 59–74. [Google Scholar]
39. Larget B., Simon D. L. Mol. Biol. Evol. 1999;16:750–759. [Google Scholar]
40. Neal R. M. J. Comput. Graph. Stat. 2000;9:249–265. [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/1365958
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/1365958

Smart citations by scite.ai
Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles.
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1073/pnas.0508279103

Supporting
Mentioning
Contrasting
1
74
0

Article citations


Go to all (48) article citations

Funding 


Funders who supported this work.

NIGMS NIH HHS (2)