Abstract
Background
The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements.Results
MLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility.Conclusions
To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php .Free full text
MLGO: phylogeny reconstruction and ancestral inference from gene-order data
Abstract
Background
The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements.
Results
MLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility.
Conclusions
To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php.
Background
As whole genomes are sequenced at increasing rates, using gene-order dataa for phylogenetic analyses and ancestral reconstruction is attracting increasing interest. Comparative genomics, evolutionary biology, and cancer research all require tools to elucidate the history and consequences of the large-scale genomic changes, such as rearrangements, duplications, losses. However, using gene-order data has proved far more challenging than using sequence data and numerous problems plague existing methods: oversimplified models, poor accuracy, poor scaling, lack of robustness, lack of statistical assessment, etc.
Genome rearrangement operations change the ordering of genes on chromosomes. An inversion operation (also called reversal) reverses both the order and orientation of a segment of a chromosome. A transposition is an operation that swaps two adjacent segments of a chromosome. In case of multiple chromosomes, a translocation breaks a chromosome and reattaches a part to another chromosome, while a fusion joins two chromosomes and a fission splits one chromosome into two. Yancopoulos et al. [1] proposed a universal double-cut-and-join (DCJ) operation that accounts for all rearrangements used to date. None of these operations alter the gene content of genomes, whereas deletions (or losses) delete segments of (one or more) contiguous genes from a chromosome, while insertions introduce a segment of (one or more) contiguous genes from external sources into a chromosome. and duplications copies an existing segment within the genome and inserts into a chromosome. Finally, whole genome duplication (WGD) creates an additional copy of the entire genome of a species.
As phylogenies play a central role in biological research, over the past decade many methods were developed to reconstruct phylogenies from gene-order data. The first algorithm for phylogeny inference from gene-order data was BPAnalysis based on breakpoint distances [2]. Moret et al. [3] later extended this approach with GRAPPA by using inversion distances. While these methods were limited to unichromosomal genomes, Bourque and Pevzner [4] developed MGR to handle multichromosomal genomes. These approaches are parsimony-based: they solve the so-called Big Parsimony Problem (BPP) and all suffer from serious scalability issues. In contrast with parsimony-based methods, distance-based methods run in time polynomial in the number and size of genomes. Lin et al. [5] have demonstrated the accuracy and scalability of a distance-based method that uses NJ [6] and FastME [7] with an accurate distance estimator [8]. Instead of working directly with the evolutionary events of the model, one can also transform the problem into the familiar sequence-based reconstruction problem. Wang et al. [9] first proposed a parsimony-based approach, MPBE (Maximum Parsimony on Binary Encoding). Recently Hu et al. [10] developed MLBE, later refined by Lin et al. [11] with MLWD, both of which demonstrate that using maximum-likelihood approaches is the decisive factor in improving the modest accuracy of MPBE.
If the tree is fixed, then computing its parsimony score is known as the Small Parsimony Problem (SPP). Ancestral reconstruction has been studied through several optimization schemes for SPP on gene-order data—using adjacencies [12–15], using conserved intervals (Roci—Reconstruction of Conserved Intervals [16]), using multiple breakpoint graphs (MGRA [17]) and supporting whole-genome duplications [18,19], where continuous regions or complete ancestral genomes have been inferred.
Relatively few of these tools are offered through web servers. Lin et al. [20] had developed a web-server version of MGR with new heuristics to speed up the original MGR algorithm, but the site is no longer accessible. Both Roci and MGRA (for ancestral reconstruction only) are offered through web servers, but none can handle complex events such as gene insertions, deletions and duplications.
We present a new tool MLGO for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGO relies on two methods we have developed: MLWD [11] for phylogenetic reconstruction and PMAG+ [21] for ancestral genome reconstruction. Our tool takes the advantage of binary encoding on gene-order data, supports a fairly general model of genomic evolution (rearrangements plus duplications, insertions, and losses of genomic regions), and successfully accommodates itself into the framework of maximized likelihood. The results of extensive testing on both simulated and real data show that both MLWD and PMAG+ can achieve great performance, scalability and flexibility, suggesting MLGO a suitable tool for large-scale analysis of high-resolution data. Furthermore, MLGO is deployed as a web service, providing the first web tool that is suitable for large scale genomic analysis with a general model of evolution.
Implementation
MLGO preprocesses the gene-order data, configures the transition model, reconstructs a phylogeny, and finally solves the SPP on that phylogeny.
Terminology
Given a set of n genes labeled as {1,2,,n}, gene-order data for a genome consists of lists of genes in the order in which they are placed along one or more chromosomes. Each gene is assigned with an orientation that is either positive, written i, or negative, written −i. Two genes i and j form an adjacency (i,j) if i is immediately followed by j, or, equivalently, −j is immediately followed by −i. If gene k lies at one end of a linear chromosome, we let k be adjacent to an extremity o to mark the beginning or ending of the chromosome, written as (o,k) or (k,o), and called telomere.
Phylogeny reconstruction
The data preprocessing and the configuration of the transition model follow the approach of MLWD [11]. Each adjacency that appears at least once in the collection of input genomes corresponds to a unique character position in the sequence and the presence or absence of any of these adjacencies in a given genomes is coded by a 1 (presence) or a 0 (absence). Since our encodings are binary sequences, the parameters of the model are simply the transition probability from presence (1) to absence (0) and that from absence (0) to presence (1). Lin et al. [11] gave the following derivation for these parameters. A DCJ operation selects uniformly at random two adjacencies (or telomeres) and replaces them by two new adjacencies (or telomeres). Since a genome with n genes and O(1) chromosomes has n+O(1) adjacencies and telomeres, the transition probability from 1 to 0 is under one DCJ operation; and since there are up to possible adjacencies and telomeres, the transition probability from 0 to 1 is . Thus the transition from 0 to 1 is roughly 2n times less likely than that from 1 to 0. Despite the restrictive assumption that all DCJ operations are equally likely, this result is in line with the observed bias in transitions of adjacencies given by Sankoff and Blanchette [22]: the probability of breaking a given ancestral adjacency is high while that of creating a particular adjacency along several lineages is low (a version of homoplasy for adjacencies). Finally, the encoding adds characters and a transition probability for the presence or absence of each unique gene. Due to duplicated genes, there is no one-to-one correspondence between genomes and the final encodings of multisets of genes, adjacencies, and telomeres. Once we have the binary sequences and transition parameters, we can reconstruct a phylogeny using maximum likelihood. Of the many implementations of this method, we chose RAxML [23] for its speed and its dedicated handling of binary sequences.
Bootstrap support
A distinct advantage of using sequence encoding is the ability to use the bootstrap method to assess the robustness of the inferred phylogeny. Doing so with gene-order data is not possible, because a chromosome with n distinct genes presents a single character (the ordering) with 2n×n! possible states (the first term is for the strandedness of each gene and the second for the possible permutations in the ordering). This single character is equivalent to an alignment with a single column, albeit one where each character can take any of a huge number of states—we cannot meaningfully resample a single character. The binary encoding effectively maps this single character into a high-dimensional binary vector, so that the standard phylogenetic bootstrap [24] can be used. While the evolution of a specific adjacency depends directly on several others, independence can be assumed if, once an adjacency is broken during evolution, it is not formed again—an analog of Dollo parsimony, but one that is very likely in rearrangement data due to the enormous state space [25].
Ancestral inference
Using the phylogeny thus computed, we then proceed to solve the SPP, now following the approach of Hu et al. [21]. The first step involves the estimation of ancestral gene contents from the contents of the input genomes. Our inference of ancestral contents relies on viewing genes and adjacencies as independent binary characters, as described for the encoding. Whether or not an ancestral genome contains a gene or an adjacency is determined by the conditional probability of the presence state of the gene or the adjacency, computed by the marginal probabilistic reconstruction method suggested by Yang et al. [26]. If such probability is larger than 50%, we conclude that the gene belongs to the genome. We extend this approach to compute the probability of observing each adjacency. We then reduce the adjacency assembly problem for any given ancestral genome to an instance of the Travelling Salesperson Problem (TSP), by representing genes as vertices and adjacencies as edges, and finally solve the TSP by using Concorde [27].
Results and discussion
MLGO is written in C++ and Perl as a web tool. Figure Figure11 shows the screen shot of the web interface for MLGO. The input format of the dataset is that used by GRAPPA and MGR: FASTA-like headers for the names of the genomes (> followed by an alphanumeric sequence followed by a newline), each chromosome represented by a signed permutation of integers ending with a $ symbol and a newline character. Phylogenies are output as trees in Newick format.
We used the genomes of 12 fully sequenced drosophila species to demonstrate the performance of MLGO. Figure Figure22 shows the consensus phylogeny reconstructed by MLGO with the bootstrap support values obtained using 100 replicates. Compared to the study using sequence data published by Clark et al. [28], all major groups in those 12 drosophila genomes were correctly identified with strong support (bootstrap value >90), except for one median support at the bipartition between D. simulans, D. sechellia and the rest. The total running time for reconstructing the phylogeny of 12 drosophila species is less than 1 minute, while ancestral reconstruction adds less than 30 minutes. We also tested the performance of MLGO on 15 Metazoan genomes from the eGOB (Eukaryotic Gene Order Browser) database [29], and the reconstructed phylogeny tree shown in Figure Figure33 is perfectly supported from existing studies [30,31].
Conclusion
As whole genomes are sequenced at increasing rates, using gene-order data for phylogenetic analyses and ancestral reconstruction is attracting increasing interest, especially coupled with the recent advances in identifying conserved synteny blocks among multiple species [32–34].
MLGO (Maximum Likelihood for Gene-Order Analysis) is the first web tool for likelihood-based inference of both the phylogeny and ancestral genomes. It provides fast and scalable analyses with bootstrap support of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications.
Availability and requirements
The web tool is available from http://www.geneorder.org/server.php.Project name: MLGOProject home page:http://www.geneorder.org/server.phpOperating system(s): Platform independentProgramming language: PerlOther requirements: NoneLicense: GNURestrictions for use by non-academics: None
Endnote
a We use the term “gene” as this is in fact a common form of syntenic blocks, but other kinds of markers could be used.
Acknowledgements
We thank Bernard Moret for helpful discussions. FH and JT were funded by NSF IIS 1161586 and an internal grant from Tianjin University, China. YL was supported by a fellowship of the Swiss National Science Foundation (grant no. 146708). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Footnotes
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
FH implemented the web server. YL contributed to the phylogeny reconstruction part with the help of FH and JT. FH and JT contributed to the ancestral inference part. JT provided advice and oversight of the project. All authors drafted, read and approved the final manuscript.
Contributor Information
Fei Hu, Email: moc.liamg@cyiefuh.
Yu Lin, Email: ude.dscu@082luy.
Jijun Tang, Email: ude.cs.sec@gnatj.
References
Articles from BMC Bioinformatics are provided here courtesy of BMC
Full text links
Read article at publisher's site: https://doi.org/10.1186/s12859-014-0354-6
Read article for free, from open access legal sources, via Unpaywall: https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-014-0354-6
Citations & impact
Impact metrics
Article citations
Drivers of interlineage variability in mitogenomic evolutionary rates in Platyhelminthes.
Heredity (Edinb), 133(4):276-286, 02 Aug 2024
Cited by: 1 article | PMID: 39095653
Genomes of Meniocus linifolius and Tetracme quadricornis reveal the ancestral karyotype and genomic features of core Brassicaceae.
Plant Commun, 5(7):100878, 11 Mar 2024
Cited by: 1 article | PMID: 38475995 | PMCID: PMC11287156
Evolutionary timescale of chalcidoid wasps inferred from over one hundred mitochondrial genomes.
Zool Res, 44(3):467-482, 01 May 2023
Cited by: 2 articles | PMID: 36994537 | PMCID: PMC10236298
Major Revisions in Pancrustacean Phylogeny and Evidence of Sensitivity to Taxon Sampling.
Mol Biol Evol, 40(8):msad175, 01 Aug 2023
Cited by: 7 articles | PMID: 37552897 | PMCID: PMC10414812
Review Free full text in Europe PMC
The chromosome-scale genome assembly of cluster bean provides molecular insight into edible gum (galactomannan) biosynthesis family genes.
Sci Rep, 13(1):9941, 19 Jun 2023
Cited by: 2 articles | PMID: 37336893 | PMCID: PMC10279686
Go to all (54) article citations
Other citations
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Phylogeny analysis from gene-order data with massive duplications.
BMC Genomics, 18(suppl 7):760, 16 Oct 2017
Cited by: 4 articles | PMID: 29513196 | PMCID: PMC5657036
Fast ancestral gene order reconstruction of genomes with unequal gene content.
BMC Bioinformatics, 17(suppl 14):413, 11 Nov 2016
Cited by: 1 article | PMID: 28185578 | PMCID: PMC5123410
Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes.
Bioinformatics, 33(14):i369-i378, 01 Jul 2017
Cited by: 14 articles | PMID: 28881993 | PMCID: PMC5870716
Reconstructing the history of large-scale genomic changes: biological questions and computational challenges.
J Comput Biol, 18(7):879-893, 12 May 2011
Cited by: 4 articles | PMID: 21563973
Review
Funding
Funders who supported this work.
Swiss National Science Foundation (1)
Grant ID: 146708