A Dirichlet process model for detecting positive selection in protein-coding DNA sequences.

Huelsenbeck JP; Jain S; Frost SW; Pond SL

doi:10.1073/pnas.0508279103

A Dirichlet process model for detecting positive selection in protein-coding DNA sequences.

Huelsenbeck JP ¹,

Jain S ,

Frost SW ,

Pond SL

Affiliations

1. Section of Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, USA.
Authors
Huelsenbeck JP¹
(1 author)

ORCIDs linked to this article

Proceedings of the National Academy of Sciences of the United States of America, 10 Apr 2006, 103(16):6263-6268
https://doi.org/10.1073/pnas.0508279103 PMID: 16606848 PMCID: PMC1458866

Free full text in Europe PMC

Abstract

Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of protein-coding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for among-site variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution by using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.

Free full text

Proc Natl Acad Sci U S A. 2006 Apr 18; 103(16): 6263–6268.

Published online 2006 Apr 10. https://doi.org/10.1073/pnas.0508279103

PMCID: PMC1458866

PMID: 16606848

A Dirichlet process model for detecting positive selection in protein-coding DNA sequences

John P. Huelsenbeck,^*^† Sonia Jain,^‡ Simon W. D. Frost,^§ and Sergei L. Kosakovsky Pond^§

Author information Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Abstract

Natural selection leaves a detectable signature in comparisons of protein-coding DNA sequences; a bias in the ratio of the rates of nonsynonymous and synonymous substitutions is unambiguous evidence for natural selection. Purifying selection causes the rate of nonsynonymous substitution to be smaller than the rate of synonymous substitution. In fact, the predominant pattern found in analysis of alignments of protein-coding DNA is that nonsynonymous substitutions have a lower rate than synonymous substitutions. This finding is consistent with natural selection acting to eliminate deleterious mutations that change the protein to a less functional form. However, natural selection can also act to increase the probability that a nonsynonymous mutation is fixed in the population. Positive, or direction, selection causes the relative rates of nonsynonymous to synonymous substitutions to be >1. Examples of positive selection have been found in many genes but perhaps most famously in human major histocompatibility complex (MHC) (1), HIV-1 envelope (env) gene (2), sperm lysins (3), and primate stomach lysozymes (4).

Although seemingly simple, detecting positive natural selection from alignments of protein-coding DNA is recognized as a formidable statistical problem. Many methods have been proposed to detect the footprint of natural selection, all of which are based on measuring the relative rates or numbers of nonsynonymous and synonymous substitutions (2, 5 –20). Most of the methods assume a constant rate of nonsynonymous and synonymous substitutions across the sequence. These methods are ill-suited for detecting positive natural selection when only a few of the amino acid positions in a gene are under the influence of positive natural selection, with the others under purifying selection. Applying a method that assumes a constant rate of nonsynonymous change across the sequence potentially masks the signature of positive natural selection that is present at only a few positions in the alignment (21). In cases where such methods have been successful, many sites are typically under strong positive selection (e.g., MHC). More recently, Nielsen and Yang (2) developed a method that allows the rate of nonsynonymous substitution to vary across the sequence. The method of Nielsen and Yang has proven useful for detecting positive natural selection in sequences where only a few sites are under directional selection.

The Nielsen and Yang (2) approach assumes that the nonsynonymous/synonymous rate ratio (d_N/d_S ratio) at a site is a random variable drawn from some probability distribution. The goal is to estimate parameters of the model from alignments of protein-coding DNA and, by using these parameter estimates, identify amino acids under positive selection. Their model is quite complicated and contains many parameters to estimate. Some of the parameters account for the fact that the sequences are related to one another through some unknown phylogenetic tree. Other parameters account for biases in the substitution process, such as an increased rate of mutation for transitions. The remaining parameters, however, describe how d_N/d_S varies across the sequence. Nielsen and Yang showed not only that it is practical to efficiently estimate these parameters from alignments of protein-coding DNA sequences but also that one can use an empirical Bayesian approach to detect specific amino acid residues that are under the influence of positive natural selection.

How should variation in the rate of nonsynonymous substitution across a sequence be modeled? Population genetics theory is largely silent on this issue. For the most part, we lack information on the distribution of selection coefficients for new mutations and the demography of the populations under consideration, making it difficult to predict a distribution for the rate of nonsynonymous substitution (22, 23). The original Nielsen and Yang (2) approach assumed that a site could be in one of three categories, each of which differed in its d_N/d_S rate ratio. With probability p₁, a site is in category 1, which has d_N/d_S = 0; with probability p₂, the site is in category 2, which has d_N/d_S = 1; and with probability p₃, the site is in a category that has d_N/d_S > 1 (p₁ + p₂ + p₃ = 1). Nielsen and Yang (2) also considered a model that allowed d_N/d_S to vary continuously on the interval (0, 1); a gamma distribution, truncated on the interval (0. 1) was used to model amino acid sites that are acting neutrally or under the influence of purifying natural selection. Under this model, a site has d_N/d_S drawn from a truncated gamma distribution with probability p₁ or has d_N/d_S > 1 with probability p₂. Later, Yang et al. (12) systematically explored more ways to model variation in d_N/d_S across a sequence. They considered a total of 13 models. The “M10” model from Yang et al. (12), for example, assumes that, with probability p₁, the d_N/d_S rate ratio is drawn from a beta distribution on the interval (0, 1) and, with probability p₂, the d_N/d_S rate ratio is drawn from an offset gamma distribution. Even though many of the models considered in Nielsen and Yang (2) and in Yang et al. (12) describe d_N/d_S as varying continuously, in practice, these continuous distributions are discretized to allow the likelihood to be calculated (the likelihood is averaged over the different discrete categories). At least as currently implemented, then, all of these models describing how the rate of nonsynonymous substitution varies across the sequence are discrete.

We take a different approach to modeling variation in d_N/d_S across sites, allowing sites to be in one of a number of classes, with each class having a different d_N/d_S ratio. The prior probability distribution for the number of classes and the d_N/d_S for each class is described by a Dirichlet process prior. The Dirichlet process prior provides a flexible way to model situations in which the data elements are drawn from a mixture of simpler parametric distributions. For typical mixture models, the number of mixture components is assumed to be known, and determining the correct number of components a priori for a particular model is difficult. For the Dirichlet process prior, however, the number of mixture components is countably infinite, obviating the need to determine the correct number of mixture components. Here we apply the Dirichlet process to the problem of detecting positive selection in protein-coding DNA sequences. Instead of assuming that the d_N/d_S rate ratio for a site is drawn from a particular parametric distribution, as is the case for Nielsen and Yang (2) and Yang et al. (12), we assume that a site is assigned to a category with a particular d_N/d_S value. The number of selection categories and the d_N/d_S value for each category are both considered random variables in our model. The Dirichlet process has been used in one other case for an evolutionary problem: Lartillot and Philippe (24) used the Dirichlet process to model variation in the substitution process across alignments of amino acid sequences. Similarly, Pond and Frost (25) described a flexible discretization scheme that is able to fit a wide range of rate distributions. However, this scheme still requires that the maximum number of rate classes be specified a priori.

We estimate the parameters of the model in a Bayesian framework. Calculating the joint posterior probability distribution of the parameters involves summation over all possible phylogenetic trees relating the sequences and, for each tree, integration over all possible combinations of parameter values. We use Markov chain Monte Carlo (MCMC) to approximate the posterior probability distribution of the parameters and apply the method to six alignments of protein-coding DNA sequences (12, 17, 26, 27).

Results and Discussion

The Dirichlet process prior has two components: One is a parameter, usually called α, that influences the probability that data elements find themselves in the same cluster. In other words, the parameter controls the “clumpiness” of the process. The other component of a Dirichlet process prior is a probability distribution that describes the probability of the parameter assigned to each cluster. (The Dirichlet process prior is described in more detail in Materials and Methods.) We examined the robustness of the inferences of positive selection to three different choices for α, choosing α such that the prior mean of the number of components (k) was E(k) ≈ 1, E(k) = 5, and E(k) = 10; we did this to keep the number of selection classes to a manageable number because the likelihood calculations become too computationally cumbersome when k is large (e.g., k > 25). An alternative that we did not explore is to place a prior probability distribution on α. Often in problems using a Dirichlet process prior, a gamma hyperprior is placed on α.

Tables 1 1 –6 compare the posterior and prior probability distributions for the number of mixture components (selection categories) for each of the six alignments. For all of the alignments we examined, there is very little posterior probability for k = 1 and k = 2, even when there may be a substantial amount of prior probability for those cases. The data are very difficult to explain with only a few selection classes.