Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Hübner L; Stamatakis A

doi:10.1101/2024.05.23.595533

Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Hübner L ,

Stamatakis A

ORCIDs linked to this article

Preprint from bioRxiv, 27 May 2024
https://doi.org/10.1101/2024.05.23.595533 PPR: PPR858318

Preprint

This article is a preprint. It may not have been peer reviewed.

Abstract

The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and – in conjunction with recent advances in ancient DNA sequencing technology – studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical tress, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit , which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson’s f , the Fixation Index, Tajima’s D , pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets.

2012 ACM Subject Classification

Applied computing → Computational genomics; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics; Applied computing → Population genetics

Full text links

Read article at publisher's site: https://doi.org/10.1101/2024.05.23.595533

Read article for free, from open access legal sources, via Unpaywall: https://www.biorxiv.org/content/biorxiv/early/2024/05/27/2024.05.23.595533.full.pdf

Citations & impact

This article has not been cited yet.

Impact metrics

Alternative metrics

Altmetric item for https://www.altmetric.com/details/163894014

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/163894014

Search life-sciences literature (45,100,050 articles, preprints and more)

Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

ORCIDs linked to this article

Abstract

2012 ACM Subject Classification

Full text links

Citations & impact

Impact metrics

Alternative metrics

Partnerships & funding

Search life-sciences literature (45,100,050 articles, preprints and more)

Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Author information

ORCIDs linked to this article

Abstract

2012 ACM Subject Classification

Full text links

Citations & impact

Impact metrics

Alternative metrics

Partnerships & funding