Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Firtina, Can; Kim, Jeremie S.; Alser, Mohammed; Cali, Damla Senol; Cicek, A. Ercument; Alkan, Can; Mutlu, Onur

doi:10.1093/bioinformatics/btaa179

Quantitative Biology > Genomics

arXiv:1902.04341 (q-bio)

[Submitted on 12 Feb 2019 (v1), last revised 7 Mar 2020 (this version, v2)]

Title:Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Authors:Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek, Can Alkan, Onur Mutlu

View PDF

Abstract:Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.

Comments:	9 pages, 1 figure. Accepted in Bioinformatics
Subjects:	Genomics (q-bio.GN); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Cite as:	arXiv:1902.04341 [q-bio.GN]
	(or arXiv:1902.04341v2 [q-bio.GN] for this version)
	https://doi.org/10.48550/arXiv.1902.04341
Journal reference:	Bioinformatics . 2020 Jun 1;36(12):3669-3679
Related DOI:	https://doi.org/10.1093/bioinformatics/btaa179

Submission history

From: Can Firtina [view email]
[v1] Tue, 12 Feb 2019 11:45:55 UTC (3,690 KB)
[v2] Sat, 7 Mar 2020 23:31:34 UTC (3,824 KB)

Quantitative Biology > Genomics

Title:Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Genomics

Title:Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators