Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and – in conjunction with recent advances in ancient DNA sequencing technology – studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical tress, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit , which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson’s f , the Fixation Index, Tajima’s D , pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets.

2012 ACM Subject Classification

Applied computing → Computational genomics; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics; Applied computing → Population genetics

Citations & impact 


This article has not been cited yet.

Impact metrics

Alternative metrics

Altmetric item for https://www.altmetric.com/details/163894014
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/163894014