DNA annotation: Difference between revisions
Ira Leviton (talk | contribs) m Fixed PMC parameters in citations. Category:CS1 maint: PMC format |
m Fix typo. |
||
(25 intermediate revisions by 15 users not shown) | |||
Line 1: | Line 1: | ||
{{short description|The process of describing the structure and function of |
{{short description|The process of describing the structure and function of a genome}} |
||
[[File:Porphyra umbilicalis chloroplast genome visualized with Chloroplot.png|thumb|370px|A visualization of ''[[Porphyra umbilicalis]]'' [[chloroplast]] genome annotation ([[GenBank]] accession: [https://www.ncbi.nlm.nih.gov/nuccore/MF385003.1 MF385003.1]) made with [https://irscope.shinyapps.io/chloroplot/ Chloroplot].<ref name="zheng2020">{{cite journal | vauthors = Zheng S, Poczai P, Hyvönen J, Tang J, Amiryousefi A | title = Chloroplot: An Online Program for the Versatile Plotting of Organelle Genomes | journal = Frontiers in Genetics | volume = 11 | issue = 576124 | pages = 576124 | date = 2020 | pmid = 33101394 | doi = 10.3389/fgene.2020.576124 | pmc = 7545089 | doi-access = free }}</ref> The number of genes, the genome length, and the [[GC content]] are placed in the middle black circle. The outer gray circle shows GC content in the every section of the genome. All individual genes are placed on the outermost circle according to their position in the genome, their [[transcription (biology)|transcription]] direction and their length; they are color-coded based on the cellular function or component they are part of. Represented with arrows, the transcription directions for the inner and outer genes are listed clockwise and anticlockwise, respectively.]] |
|||
{{More citations needed|date=November 2010}} |
|||
In [[molecular biology]] and [[genetics]], '''DNA annotation''' or '''genome annotation''' is the process of describing the structure and function of the components of a [[genome]],<ref name="dominguez2018">{{cite journal | vauthors = Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, Amselem J, Bouri L, Bocs S, Klopp C, Gibrat JF, Vlasova A, Leskosek BL, Soler L, Binzer-Panchal M, Lantz H | display-authors = 6 | title = Ten steps to get started in Genome Assembly and Annotation | journal = F1000Research | volume = 7 | issue = 148 | date = 5 February 2018 | page = 148 | pmid = 29568489 | doi = 10.12688/f1000research.13598.1 | pmc = 5850084 | doi-access = free }}</ref> by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate.<ref name=Stein2001>{{cite journal | vauthors = Stein L | title = Genome annotation: from sequence to biology | journal = Nature Reviews. Genetics | volume = 2 | issue = 7 | pages = 493–503 | date = July 2001 | pmid = 11433356 | doi = 10.1038/35080529 | s2cid = 12044602 }}</ref> Among other things, it identifies the locations of [[gene]]s and all the [[coding region]]s in a genome and determines what those genes do.<ref name="davisdef">{{cite web | vauthors = Davis CP |title=Medical Definition of Genome annotation |url=https://www.medicinenet.com/genome_annotation/definition.htm |website=MedicineNet |access-date=17 April 2023 |archive-url=https://web.archive.org/web/20230209021747/https://www.medicinenet.com/genome_annotation/definition.htm |archive-date=9 February 2023 |date=29 March 2021}}</ref> |
|||
[[File:Porphyra umbilicalis chloroplast genome visualized with Chloroplot.png|thumb|370px|A visualization of ''[[Porphyra umbilicalis]]'' [[chloroplast]] genome annotation ([[GenBank]] accession: [https://www.ncbi.nlm.nih.gov/nuccore/MF385003.1 MF385003.1]) made with [https://irscope.shinyapps.io/chloroplot/ Chloroplot].<ref name="zheng2020">{{cite journal |last1=Zheng |first1=S. |last2=Poczai |first2=P. |last3=Hyvönen |first3=J. |last4=Tang |first4=J. |last5=Amiryousefi |first5=A |title=Chloroplot: An Online Program for the Versatile Plotting of Organelle Genomes |journal=Frontiers in Genetics |date=2020 |volume=11 |issue=576124 |doi=10.3389/fgene.2020.576124 |pmid=33101394 |url=https://www.frontiersin.org/articles/10.3389/fgene.2020.576124/full |access-date=17 April 2023}}</ref> The number of genes, the genome length, and the [[GC content]] are placed in the middle black circle. The outer gray circle shows GC content in the every section of the genome. All individual genes are placed on the outermost circle according to their position in the genome, their [[transcription (biology)|transcription]] direction and their length; they are color-coded based on the cellular function or component they are part of. Represented with arrows, the transcription directions for the inner and outer genes are listed clockwise and anticlockwise, respectively.]] |
|||
In [[molecular biology]] and [[genetics]], '''DNA annotation''' or '''genome annotation''' is the process of describing the structure and function of the [[DNA]] [[nucleic acid sequence|sequences]] contained in a [[genome]],<ref name="dominguez2018">{{cite journal |last1=Dominguez Del Angel |first1=V. |last2=Hjerde |first2=E. |last3=Sterck |first3=L. |last4=Capella-Gutierrez |first4=S. |last5=Notredame |first5=C. |last6=Vinnere Petterson |first6=O. |last7=Amselem |first7=J. |last8=Bouri |first8=L. |last9=Bocs |first9=S. |last10=Klopp |first10=C. |last11=Gribat |first11=J. F. |last12=Vlasova |first12=A. |last13=Leskosek |first13=B. L. |last14=Soler |first14=L. |last15=Binzer-Panchal |first15=M. |last16=Lantz |first16=H. |title=Ten steps to get started in Genome Assembly and Annotation |journal=F1000Research |date=5 February 2018 |volume=7 |issue=148 |pages=1-25 |doi=10.12688/f1000research.13598.1 |pmid=29568489 |url=https://f1000research.com/articles/7-148 |access-date=17 April 2023}}</ref> by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate.<ref name=Stein2001>{{cite journal |
|||
| last = Stein |
|||
| first = L. |
|||
| year = 2001 |
|||
| title = Genome annotation: from sequence to biology |
|||
| journal = [[Nature Reviews Genetics]] |
|||
| volume = 2 |
|||
| pages = 493–503 |
|||
| doi = 10.1038/35080529 |
|||
| pmid = 11433356 |
|||
| issue = 7 |
|||
| s2cid = 12044602 |
|||
}}</ref> Among other things, it identifies the locations of [[gene]]s and all the [[coding region]]s in a genome and determines what those genes do.<ref name="davisdef">{{cite web |last1=Davis |first1=C. P. |title=Medical Definition of Genome annotation |url=https://www.medicinenet.com/genome_annotation/definition.htm |website=MedicineNet |access-date=17 April 2023 |archive-url=http://web.archive.org/web/20230209021747/https://www.medicinenet.com/genome_annotation/definition.htm |archive-date=9 February 2023 |date=29 March 2021}}</ref> |
|||
Annotation is performed after a genome is [[DNA sequencing|sequenced]] and [[sequence assembly|assembled]], and is a necessary step in genome analysis before the sequence is deposited in a [[List of biological databases#Genome databases|database]] and described in a published article. Although describing individual genes and their products or functions is sufficient to consider this description as an annotation, the depth of analysis reported in literature for different genomes vary widely, with some reports including additional information that goes beyond a simple annotation.<ref name="koonin2003">{{cite book | |
Annotation is performed after a genome is [[DNA sequencing|sequenced]] and [[sequence assembly|assembled]], and is a necessary step in genome analysis before the sequence is deposited in a [[List of biological databases#Genome databases|database]] and described in a published article. Although describing individual genes and their products or functions is sufficient to consider this description as an annotation, the depth of analysis reported in literature for different genomes vary widely, with some reports including additional information that goes beyond a simple annotation.<ref name="koonin2003">{{cite book | vauthors = Koonin E, Galperin MY |title=Sequence — Evolution — Function |date=2003 |publisher=Springer US |pages=193–226 |edition=1st | doi = 10.1007/978-1-4757-3783-7_6 | isbn = 978-1-4757-3783-7 |chapter=Genome Annotation and Analysis }}</ref> Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means. However, the conclusions drawn from the obtained results require manual expert analysis.<ref name="mishra2021">{{cite book | vauthors = Mishra P, Maurya R, Avashthi H, Mittal S, Chandra M, Ramteke PW | veditors = Singh DB, Pathak RK |title=Bioinformatics: Methods and Applications |date=2021 |publisher=Elsevier Science |pages=49–66 |edition=1st | doi = 10.1016/B978-0-323-89775-4.00013-4 |chapter=Genome assembly and annotation | isbn = 9780323897754 }}</ref> |
||
DNA annotation is classified into two categories: ''structural annotation'', which identifies and demarcates elements in a genome, and ''functional annotation'', which assigns functions to these elements.<ref name="bright2009">{{cite journal | |
DNA annotation is classified into two categories: ''structural annotation'', which identifies and demarcates elements in a genome, and ''functional annotation'', which assigns functions to these elements.<ref name="bright2009">{{cite journal | vauthors = Bright LA, Burgess SC, Chowdhary B, Swiderski CE, McCarthy FM | title = Structural and functional-annotation of an equine whole genome oligoarray | journal = BMC Bioinformatics | volume = 10 | issue = Suppl 11 | pages = S8 | date = October 2009 | pmid = 19811692 | doi = 10.1186/1471-2105-10-S11-S8 | pmc = 3226197 | doi-access = free }}</ref> This is not the only way in which it has been categorized, as several alternatives, such as dimension-based<ref name="reed2006">{{cite journal | vauthors = Reed JL, Famili I, Thiele I, Palsson BO | title = Towards multidimensional genome annotation | journal = Nature Reviews. Genetics | volume = 7 | issue = 2 | pages = 130–141 | date = February 2006 | pmid = 16418748 | doi = 10.1038/nrg1769 | s2cid = 13107786 }}</ref> and level-based classifications,<ref name="Stein2001" /> have also been proposed. |
||
Educational materials on some aspects of biological annotation from the 2006 [[Gene Ontology]] annotation camp and similar events are available at the Gene Ontology website.<ref name="gores">{{cite web|title=GO Teaching Resources|url=http://www.geneontology.org/GO.teaching.resources.shtml|access-date=21 September 2006|archive-url=https://web.archive.org/web/20061010053534/http://www.geneontology.org/GO.teaching.resources.shtml|archive-date=10 October 2006|url-status=dead}}</ref> |
|||
==History== |
==History== |
||
The first generation of genome annotators used local ''[[ab initio]]'' methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one [[open reading frame]] (ORF) at a time<ref name="abril2019">{{cite book | |
The first generation of genome annotators used local ''[[ab initio]]'' methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one [[open reading frame]] (ORF) at a time.<ref name="abril2019">{{cite book | vauthors = Abril JF, Castellano S | veditors = Ranganathan S, Nakai K, Schonbach C, Gribskov M |title=Encyclopedia of Bioinformatics and Computational Biology |date=2019 |publisher=Elsevier Science |isbn=978-0-12-811432-2 |pages=195–209 |edition=1st |chapter=Genome Annotation | doi = 10.1016/B978-0-12-809633-8.20226-4 | s2cid = 226248103 }}</ref><ref name="tatusova2016">{{cite journal | vauthors = Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J | display-authors = 6 | title = NCBI prokaryotic genome annotation pipeline | journal = Nucleic Acids Research | volume = 44 | issue = 14 | pages = 6614–6624 | date = August 2016 | pmid = 27342282 | doi = 10.1093/nar/gkw569 | pmc = 5001611 }}</ref> They appeared as a necessity to handle the enormous amount of data produced by the [[Maxam–Gilbert sequencing|Maxam-Gilbert]] and [[Sanger sequencing|Sanger]] [[DNA sequencing]] techniques developed in the late 1970s. The first software used to analyze sequencing [[Read (biology)|reads]] is the [[Staden Package]], created by Rodger Staden in 1977.<ref name="staden1977">{{cite journal | vauthors = Staden R | title = Sequence data handling by computer | journal = Nucleic Acids Research | volume = 4 | issue = 11 | pages = 4037–4051 | date = November 1977 | pmid = 593900 | doi = 10.1093/nar/4.11.4037 | pmc = 343220 }}</ref> It performed several tasks related to annotation, such as [[nucleobase|base]] and [[genetic code|codon]] counts. In fact, codon usage was the main strategy used by several early [[protein coding sequence]] (CDS) prediction methods,<ref name="staden1982">{{cite journal | vauthors = Staden R, McLachlan AD | title = Codon preference and its use in identifying protein coding regions in long DNA sequences | journal = Nucleic Acids Research | volume = 10 | issue = 1 | pages = 141–156 | date = January 1982 | pmid = 7063399 | doi = 10.1093/nar/10.1.141 | pmc = 326122 }}</ref><ref name="gribskov1984">{{cite journal | vauthors = Gribskov M, Devereux J, Burgess RR | title = The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression | journal = Nucleic Acids Research | volume = 12 | issue = 1 Pt 2 | pages = 539–549 | date = January 1984 | pmid = 6694906 | doi = 10.1093/nar/12.1part2.539 | pmc = 321069 }}</ref><ref name="fickett1996">{{cite journal | vauthors = Fickett JW | title = Finding genes by computer: the state of the art | journal = Trends in Genetics | volume = 12 | issue = 8 | pages = 316–320 | date = August 1996 | pmid = 8783942 | doi = 10.1016/0168-9525(96)10038-X }}</ref> based on the assumption that the most [[translation (biology)|translated]] regions in a genome contain codons with the most abundant corresponding [[tRNA]]s (the molecules responsible for carrying [[amino acid]]s to the [[ribosome]] during protein synthesis) allowing a more efficient translation.<ref name="grosjean1982">{{cite journal | vauthors = Grosjean H, Fiers W | title = Preferential codon usage in prokaryotic genes: the optimal codon-anticodon interaction energy and the selective codon usage in efficiently expressed genes | journal = Gene | volume = 18 | issue = 3 | pages = 199–209 | date = June 1982 | pmid = 6751939 | doi = 10.1016/0378-1119(82)90157-3 }}</ref> This was also known to be the case for [[codon usage bias|synonymous codon]]s, which are often present in proteins expressed at a lower level.<ref name="gribskov1984" /><ref name="grantham1980">{{cite journal | vauthors = Grantham R, Gautier C, Gouy M, Mercier R, Pavé A | title = Codon catalog usage and the genome hypothesis | journal = Nucleic Acids Research | volume = 8 | issue = 1 | pages = r49–r62 | date = January 1980 | pmid = 6986610 | doi = 10.1093/nar/8.1.197-c | pmc = 327256 }}</ref> |
||
The advent of complete genomes in the 1990s (the first one being the genome of ''[[Haemophilus influenzae]]'' sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ''ab initio'' methods, but now applied on a genome-wide scale.<ref name="abril2019" /><ref name="tatusova2016" /> [[Markov model]]s are the driving force behind many algorithms used within annotators of this generation;<ref name="lukashin1998">{{cite journal | |
The advent of complete genomes in the 1990s (the first one being the genome of ''[[Haemophilus influenzae]]'' sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ''ab initio'' methods, but now applied on a genome-wide scale.<ref name="abril2019" /><ref name="tatusova2016" /> [[Markov model]]s are the driving force behind many algorithms used within annotators of this generation;<ref name="lukashin1998">{{cite journal | vauthors = Lukashin AV, Borodovsky M | title = GeneMark.hmm: new solutions for gene finding | journal = Nucleic Acids Research | volume = 26 | issue = 4 | pages = 1107–1115 | date = February 1998 | pmid = 9461475 | doi = 10.1093/nar/26.4.1107 | pmc = 147337 }}</ref><ref name="salzberg1998">{{cite journal | vauthors = Salzberg SL, Delcher AL, Kasif S, White O | title = Microbial gene identification using interpolated Markov models | journal = Nucleic Acids Research | volume = 26 | issue = 2 | pages = 544–548 | date = January 1998 | pmid = 9421513 | doi = 10.1093/nar/26.2.544 | pmc = 147303 }}</ref> these models can be thought of as [[directed graph]]s where nodes represent different genomic signals (such as [[transcription (biology)|transcription]] and [[translation (biology)|translation]] start sites) connected by arrows representing the scanning of the sequence. To ensure a Markov model detects a genomic signal, it must first be trained on a series of known genomic signals.<ref name="soh2012">{{cite book | vauthors = Soh J, Gordon PM, Sensen CW |title=Genome Annotation |date=4 September 2012 |publisher=Chapman and Hall/CRC |location=New York |doi=10.1201/b12682 |isbn=9780429064012 |url=https://www.taylorfrancis.com/books/mono/10.1201/b12682/genome-annotation-jung-soh-christoph-sensen-paul-gordon |access-date=18 April 2023 |archive-url=https://web.archive.org/web/20230418022925/https://www.taylorfrancis.com/books/mono/10.1201/b12682/genome-annotation-jung-soh-christoph-sensen-paul-gordon |archive-date=18 April 2023 }}</ref> The output of Markov models in the context of annotation includes the probabilities of every kind of genomic element in every single part of the genome, and an accurate Markov model will assign high probabilities to correct annotations and low probabilities to the incorrect ones.<ref name="brent2005">{{cite journal | vauthors = Brent MR | title = Genome annotation past, present, and future: how to define an ORF at each locus | journal = Genome Research | volume = 15 | issue = 12 | pages = 1777–1786 | date = December 2005 | pmid = 16339376 | doi = 10.1101/gr.3866105 | doi-access = free }}</ref> |
||
[[File:Genome Annotation Timeline.svg|thumb|600px|A release timeline of genome annotators. The dotted boxes indicate the four different generations of genome annotators and their most representative characteristics. First generation (blue) where annotators used ''ab initio'' methods at a local scale, second generation (red) with genome-wide ab initio methods, third generation (green) characterized by a combination of ''ab initio'' methods and homology-based annotations, and the fourth generation (orange) in which an approach to identification of the non-coding regions of DNA and study at the population level represented by the pangenome begun.]] |
|||
As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so |
As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so-called combiner annotators, which perform both ''ab initio'' and homology-based annotation, require fast [[sequence alignment|alignment]] algorithms to identify regions of [[sequence homology|homology]].<ref name="dominguez2018" /><ref name="abril2019" /><ref name="tatusova2016" /> |
||
In the late 2000s, genome annotation shifted its attention towards identifying [[non-coding region]]s in DNA, which was achieved thanks to the appearance of methods to analyze [[DNA binding site|transcription factor binding sites]], [[DNA methylation]] sites, [[chromatin]] structure, and other [[RNA]] and [[regulatory region]] analysis techniques. Other genome annotators also began to focus on population-level studies represented by the [[pangenome]]; by doing so, for instance, annotation pipelines ensure that core genes of a [[clade]] are also found in new genomes of the same clade. Both annotation strategies constitute the fourth generation of genome annotators.<ref name="abril2019" /><ref name="tatusova2016" /> |
In the late 2000s, genome annotation shifted its attention towards identifying [[non-coding region]]s in DNA, which was achieved thanks to the appearance of methods to analyze [[DNA binding site|transcription factor binding sites]], [[DNA methylation]] sites, [[chromatin]] structure, and other [[RNA]] and [[regulatory region]] analysis techniques. Other genome annotators also began to focus on population-level studies represented by the [[pangenome]]; by doing so, for instance, annotation pipelines ensure that core genes of a [[clade]] are also found in new genomes of the same clade. Both annotation strategies constitute the fourth generation of genome annotators.<ref name="abril2019" /><ref name="tatusova2016" /> |
||
By the 2010s, the genome sequences of more than a thousand |
By the 2010s, the genome sequences of more than a thousand-human individuals (through the [[1000 Genomes Project]]) and several [[model organisms]] became available. As such, genome annotation remains a major challenge for scientists investigating the [[human genome|human]] and other genomes.<ref name='encode2012plosGuide'>{{cite journal | vauthors = ((ENCODE Project Consortium)) | title = A user's guide to the encyclopedia of DNA elements (ENCODE) | journal = PLOS Biology | volume = 9 | issue = 4 | pages = e1001046 | date = April 2011 | pmid = 21526222 | pmc = 3079585 | doi = 10.1371/journal.pbio.1001046 | editor-link = Peter Becker (biologist) | veditors = Becker PB | doi-access = free }} {{open access}}</ref><ref name='1001genomes2012'>{{cite journal | vauthors = Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA | display-authors = 6 | title = An integrated map of genetic variation from 1,092 human genomes | journal = Nature | volume = 491 | issue = 7422 | pages = 56–65 | date = November 2012 | pmid = 23128226 | pmc = 3498066 | doi = 10.1038/nature11632 | bibcode = 2012Natur.491...56T }}</ref> |
||
| edition = 2nd |
|||
| publisher = Wiley-Blackwell |
|||
| isbn = 9780470085851 |
|||
| last = Pevsner |
|||
| first = Jonathan |
|||
| title = Bioinformatics and functional genomics |
|||
| location = Hoboken, N.J |
|||
| year = 2009 |
|||
| url-access = registration |
|||
| url = https://archive.org/details/bioinformaticsfu00pevs_0 |
|||
}}</ref> Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".<ref name='encode2012'>{{Cite journal | last1 = Dunham | first1 = I. | last2 = Bernstein | first2 = A. | last3 = Birney | first3 = S. F. | last4 = Dunham | first4 = P. J. | last5 = Green | first5 = C. A. | last6 = Gunter | first6 = F. | last7 = Snyder | first7 = C. B. | last8 = Frietze | first8 = S. | last9 = Harrow | first9 = J. | last10 = Kaul | doi = 10.1038/nature11247 | first10 = R. | last11 = Khatun | first11 = J. | last12 = Lajoie | first12 = B. R. | last13 = Landt | first13 = S. G. | last14 = Lee | first14 = B. K. | last15 = Pauli | first15 = F. | last16 = Rosenbloom | first16 = K. R. | last17 = Sabo | first17 = P. | last18 = Safi | first18 = A. | last19 = Sanyal | first19 = A. | last20 = Shoresh | first20 = N. | last21 = Simon | first21 = J. M. | last22 = Song | first22 = L. | last23 = Trinklein | first23 = N. D. | last24 = Altshuler | first24 = R. C. | last25 = Birney | first25 = E. | last26 = Brown | first26 = J. B. | last27 = Cheng | first27 = C. | last28 = Djebali | first28 = S. | last29 = Dong | first29 = X. | last30 = Dunham | first30 = I. | display-authors = 29 | title = An integrated encyclopedia of DNA elements in the human genome | journal = Nature | volume = 489 | issue = 7414 | pages = 57–74 | year = 2012 | pmid = 22955616| pmc = 3439153| bibcode = 2012Natur.489...57T }}</ref> |
|||
==Structural annotation== |
==Structural annotation== |
||
[[File:Structural Annotation Flowchart.svg|thumb|350px|Generalized flowchart of a structural genome annotation pipeline. First, the [[repeated sequence|repetitive regions]] of an [[genome assembly|assembled]] [[genome]] are masked by using a repeat library. Then, optionally, the masked sequence is aligned with all the available evidence ([[expressed sequence tag|EST]]s, [[RNA]]s, and [[protein]]s) of the organism being annotated. In [[eukaryotic]] genomes, [[RNA splicing|splice sites]] must be identified. Finally, the [[coding sequence|coding]] and [[noncoding DNA|noncoding]] sequences contained in the genome are predicted with the help of databases of known DNA, RNA and protein sequences, as well as other supporting information.]] |
|||
Structural annotation describes the precise location of the different elements in a genome, such as [[open reading frame]]s (ORFs), [[coding sequence]]s (CDS), [[exon]]s, [[intron]]s, [[Repeated sequence (DNA)|repeat]]s, [[RNA splicing|splice sites]], [[Regulatory sequence|regulatory motifs]], [[start codon|start]] and [[stop codon|stop]] [[genetic code|codons]], and [[promoter (genetics)|promoters]].<ref name="mishra2021" /><ref name="kahl2015">{{cite book |last1=Kahl |first1=Günter |title=The dictionary of genomics, transcriptomics and proteomics |date=2015 |publisher=Wiley |location=Weinheim |isbn=9783527678679 |edition=Fifth |url=https://onlinelibrary.wiley.com/doi/book/10.1002/9783527678679 |access-date=24 April 2023 |archive-url=https://web.archive.org/web/20220804080320/https://onlinelibrary.wiley.com/doi/book/10.1002/9783527678679 |archive-date=4 August 2022}}</ref> The main steps of structural annotation are: |
|||
Structural annotation describes the precise location of the different elements in a genome, such as [[open reading frame]]s (ORFs), [[coding sequence]]s (CDS), [[exon]]s, [[intron]]s, [[Repeated sequence (DNA)|repeat]]s, [[RNA splicing|splice sites]], [[Regulatory sequence|regulatory motifs]], [[start codon|start]] and [[stop codon|stop]] [[genetic code|codons]], and [[promoter (genetics)|promoters]].<ref name="mishra2021" /><ref name="kahl2015">{{cite book | vauthors = Kahl G |title=The dictionary of genomics, transcriptomics and proteomics |date=2015 |publisher=Wiley |location=Weinheim |doi=10.1002/9783527678679 |isbn=9783527678679 |edition=Fifth |url=https://onlinelibrary.wiley.com/doi/book/10.1002/9783527678679 |access-date=24 April 2023 |archive-url=https://web.archive.org/web/20220804080320/https://onlinelibrary.wiley.com/doi/book/10.1002/9783527678679 |archive-date=4 August 2022}}</ref> The main steps of structural annotation are: |
|||
# Repeat identification and masking. |
# Repeat identification and masking. |
||
Line 55: | Line 32: | ||
===Repeat identification and masking=== |
===Repeat identification and masking=== |
||
The first step of structural annotation consists in the identification and masking of [[repeated sequence (DNA)|repeat]]s, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and [[transposon]]s (which are larger elements with several copies across the genome)<ref name="dominguez2018" /><ref name="yandell2012"/> Repeats are a major component of both prokaryotic and eukaryotic genomes; for instance, between 0% and over 42% of prokaryotic genomes consist of repeats<ref name="treangen2009">{{cite journal | |
The first step of structural annotation consists in the identification and masking of [[repeated sequence (DNA)|repeat]]s, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and [[transposon]]s (which are larger elements with several copies across the genome).<ref name="dominguez2018" /><ref name="yandell2012"/> Repeats are a major component of both prokaryotic and eukaryotic genomes; for instance, between 0% and over 42% of prokaryotic genomes consist of repeats<ref name="treangen2009">{{cite journal | vauthors = Treangen TJ, Abraham AL, Touchon M, Rocha EP | title = Genesis, effects and fates of repeats in prokaryotic genomes | journal = FEMS Microbiology Reviews | volume = 33 | issue = 3 | pages = 539–571 | date = May 2009 | pmid = 19396957 | doi = 10.1111/j.1574-6976.2009.00169.x | doi-access = free }}</ref> and three quarters of the [[human genome]] are composed of repetitive elements.<ref name="liehr2021">{{cite journal | vauthors = Liehr T | title = Repetitive Elements in Humans | journal = International Journal of Molecular Sciences | volume = 22 | issue = 4 | date = February 2021 | page = 2072 | pmid = 33669810 | pmc = 7922087 | doi = 10.3390/ijms22042072 | doi-access = free }}</ref> |
||
Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly-defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods:<ref name="yandell2012"/><ref name="berman2007">{{cite journal | |
Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly-defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods:<ref name="yandell2012"/><ref name="berman2007">{{cite journal | vauthors = Bergman CM, Quesneville H | title = Discovering and detecting transposable elements in genome sequences | journal = Briefings in Bioinformatics | volume = 8 | issue = 6 | pages = 382–392 | date = November 2007 | pmid = 17932080 | doi = 10.1093/bib/bbm048 | doi-access = free }}</ref> |
||
* '''''De novo'' methods'''. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of [[conserved sequence|sequence conservation]] in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved [[coding sequence |
* '''''De novo'' methods'''. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of [[conserved sequence|sequence conservation]] in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved [[coding sequence]]s (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history.<ref name="alexander2010">{{cite journal | vauthors = Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB | title = Annotating non-coding regions of the genome | journal = Nature Reviews. Genetics | volume = 11 | issue = 8 | pages = 559–571 | date = August 2010 | pmid = 20628352 | doi = 10.1038/nrg2814 | s2cid = 6617359 }}</ref> |
||
* '''Homology-based methods'''. Repeats are identified by similarity ([[sequence homology|homology]]) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with ''de novo'' methods, but are biased towards previously identified families. |
* '''Homology-based methods'''. Repeats are identified by similarity ([[sequence homology|homology]]) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with ''de novo'' methods, but are biased towards previously identified families. |
||
* '''Structure-based methods'''. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable. |
* '''Structure-based methods'''. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable. |
||
* '''Comparative genomic methods'''. Repeats are identified as disruptions of one or more sequences in a [[multiple sequence alignment]] produced by large [[insertion (genetics)|insertion]] regions. Although this strategy avoids the poorly-defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question. |
* '''Comparative genomic methods'''. Repeats are identified as disruptions of one or more sequences in a [[multiple sequence alignment]] produced by large [[insertion (genetics)|insertion]] regions. Although this strategy avoids the poorly-defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question. |
||
After the repetitive regions in a genome have been identified, they are masked. ''Masking'' means replacing the letters of the [[nucleotide]]s (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an [[open reading frame]] (ORF) in a transposon as an [[exon]])<ref name="yandell2012" /> Depending on the letters used for replacement, masking can be classified as soft or hard: in ''soft masking'', repetitive regions are indicated with lowercase letters (a, c, g, or t), whereas in ''hard masking'', the letters of these regions are replaced with |
After the repetitive regions in a genome have been identified, they are masked. ''Masking'' means replacing the letters of the [[nucleotide]]s (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an [[open reading frame]] (ORF) in a transposon as an [[exon]])<ref name="yandell2012" /> Depending on the letters used for replacement, masking can be classified as soft or hard: in ''soft masking'', repetitive regions are indicated with lowercase letters (a, c, g, or t), whereas in ''hard masking'', the letters of these regions are replaced with N's. This way, for example, soft masking can be used to exclude word matches and avoid initiating an [[sequence alignment|alignment]] in those regions, and hard masking, apart from all of this, can also exclude masked regions from alignment scores.<ref name="edgar2010">{{cite journal | vauthors = Edgar RC | title = Search and clustering orders of magnitude faster than BLAST | journal = Bioinformatics | volume = 26 | issue = 19 | pages = 2460–2461 | date = October 2010 | pmid = 20709691 | doi = 10.1093/bioinformatics/btq461 | doi-access = free }}</ref><ref name="edgarnd">{{cite web | vauthors = Edgar R |title=Sequence masking |url=https://drive5.com/usearch/manual/masking.html |website=drive5.com |access-date=25 April 2023 |archive-url=https://web.archive.org/web/20200203200125/https://drive5.com/usearch/manual/masking.html |archive-date=3 February 2020 |language=en}}</ref> |
||
===Evidence alignment=== |
===Evidence alignment=== |
||
Line 70: | Line 47: | ||
The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known [[expressed sequence tag]]s (ESTs), [[RNA]]s and [[protein]]s of the organism being annotated with the genome.<ref name="ejigu2020"/> Although it is optional, it can improve gene sequence elucidation because RNAs and proteins are direct products of coding sequences.<ref name="soh2012" /> |
The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known [[expressed sequence tag]]s (ESTs), [[RNA]]s and [[protein]]s of the organism being annotated with the genome.<ref name="ejigu2020"/> Although it is optional, it can improve gene sequence elucidation because RNAs and proteins are direct products of coding sequences.<ref name="soh2012" /> |
||
If [[RNA-Seq]] data is available, it may be used to annotate and quantify all of the genes and their [[protein isoform|isoforms]] located in the corresponding genome, providing not only their locations, but also their rates of expression.<ref name="garber2011">{{cite journal | |
If [[RNA-Seq]] data is available, it may be used to annotate and quantify all of the genes and their [[protein isoform|isoforms]] located in the corresponding genome, providing not only their locations, but also their rates of expression.<ref name="garber2011">{{cite journal | vauthors = Garber M, Grabherr MG, Guttman M, Trapnell C | title = Computational methods for transcriptome annotation and quantification using RNA-seq | journal = Nature Methods | volume = 8 | issue = 6 | pages = 469–477 | date = June 2011 | pmid = 21623353 | doi = 10.1038/nmeth.1613 | s2cid = 205419756 }}</ref> However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode [[operon]]s of more than one gene, and their start and stop codons cannot be determined due to [[ribosomal frameshift|frameshifts]] and [[initiation factor|translation initiation factors]].<ref name="soh2012" /> To solve this problem, [[proteogenomics]] based approaches are employed, which utilize information from expressed proteins often derived from [[mass spectrometry]].<ref name="Gupta07">{{cite journal | vauthors = Gupta N, Tanner S, Jaitly N, Adkins JN, Lipton M, Edwards R, Romine M, Osterman A, Bafna V, Smith RD, Pevzner PA | display-authors = 6 | title = Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation | journal = Genome Research | volume = 17 | issue = 9 | pages = 1362–1377 | date = September 2007 | pmid = 17690205 | pmc = 1950905 | doi = 10.1101/gr.6427907 }}</ref> |
||
===Splice identification=== |
===Splice identification=== |
||
Annotation of [[eukaryotic]] genomes has an extra layer of difficulty due to [[RNA splicing]], a [[post-transcriptional modification|post-transcriptional process]] in which [[intron]]s (non-coding regions) are removed and [[exon]]s (coding regions) are joined |
Annotation of [[eukaryotic]] genomes has an extra layer of difficulty due to [[RNA splicing]], a [[post-transcriptional modification|post-transcriptional process]] in which [[intron]]s (non-coding regions) are removed and [[exon]]s (coding regions) are joined.<ref name="kahl2015" /> Therefore, eukaryotic [[coding sequence]]s (CDS) are discontinuous, and, to ensure their proper identification, intronic regions must be filtered. To do so, annotation pipelines must find the exon-intron boundaries, and multiple methodologies have been developed for this purpose. One solution is to use known exon boundaries for alignment; for instance, many introns begin with GT and end with AG.<ref name="ejigu2020"/> This approach, however, cannot detect novel boundaries, so alternatives like [[machine learning]] algorithms exist that are trained on known exon boundaries and [[phred quality score|quality information]] to predict new ones.<ref name="debona2008">{{cite journal | vauthors = De Bona F, Ossowski S, Schneeberger K, Rätsch G | title = Optimal spliced alignments of short sequence reads | journal = Bioinformatics | volume = 24 | issue = 16 | pages = i174–i180 | date = August 2008 | pmid = 18689821 | doi = 10.1093/bioinformatics/btn300 | doi-access = free }}</ref> Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low [[coverage (genetics)|sequence coverage]] or high error-rates produced during sequencing.<ref name="trapnell2009">{{cite journal | vauthors = Trapnell C, Pachter L, Salzberg SL | title = TopHat: discovering splice junctions with RNA-Seq | journal = Bioinformatics | volume = 25 | issue = 9 | pages = 1105–1111 | date = May 2009 | pmid = 19289445 | pmc = 2672628 | doi = 10.1093/bioinformatics/btp120 }}</ref><ref name="krizanovic2018">{{cite journal | vauthors = Križanovic K, Echchiki A, Roux J, Šikic M | title = Evaluation of tools for long read RNA-seq splice-aware alignment | journal = Bioinformatics | volume = 34 | issue = 5 | pages = 748–754 | date = March 2018 | pmid = 29069314 | pmc = 6192213 | doi = 10.1093/bioinformatics/btx668 }}</ref> |
||
=== |
===Feature prediction=== |
||
{{ |
{{further|Gene prediction}} |
||
A genome is divided in [[coding sequence|coding]] and [[noncoding DNA|noncoding]] regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is [[gene prediction]], which is why numerous methods have been developed for this purpose.<ref name="soh2012" /> Gene prediction is a misleading term, as most gene predictors only identify [[coding sequence]]s (CDS) and do not report [[untranslated region]]s (UTRs); for this reason, CDS prediction has been proposed as a more accurate term.<ref name="yandell2012" /> CDS predictors detect genome features through methods called ''sensors'', which include ''signal sensors'' that identify functional site signals such as [[promoter (genetics)|promoters]] and [[polyadenylation|polyA sites]], and ''content sensors'' that classify DNA sequences into coding and noncoding content.<ref name="mchardy2017">{{cite book | vauthors = McHardy AC, Kloetgen A | title = Bioinformatics | chapter = Finding Genes in Genome Sequence | veditors = Keith JM | series = Methods in Molecular Biology |date=2017 | volume = 1525 |publisher=Springer |location=New York |isbn=978-1-4939-6622-6 |edition=Second | pages=271–291 | doi = 10.1007/978-1-4939-6622-6_11 | pmid = 27896725 }}</ref> Whereas [[prokaryotic]] CDS predictors mostly deal with [[open reading frames]] (ORFs), which are segments of DNA between the [[start codon|start]] and [[stop codon|stop]] [[genetic code|codons]], [[eukaryotic]] CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes.<ref name="Stein2001" /> CDS prediction methods can be classified into three broad categories:<ref name="dominguez2018" /><ref name="ejigu2020" /> |
|||
* '''''Ab initio'' methods''' (also called statistical, intrinsic, or de novo). CDS prediction is based solely on the information that can be extracted from the DNA sequence. They rely on statistical methods such as the [[hidden Markov model]] (HMM). Some methods employ two or more genomes to infer local mutation rates and patterns along the genome.<ref name="brentguigo2004">{{cite journal | |
* '''''Ab initio'' methods''' (also called statistical, intrinsic, or de novo). CDS prediction is based solely on the information that can be extracted from the DNA sequence. They rely on statistical methods such as the [[hidden Markov model]] (HMM). Some methods employ two or more genomes to infer local mutation rates and patterns along the genome.<ref name="brentguigo2004">{{cite journal | vauthors = Brent MR, Guigó R | title = Recent advances in gene structure prediction | journal = Current Opinion in Structural Biology | volume = 14 | issue = 3 | pages = 264–272 | date = June 2004 | pmid = 15193305 | doi = 10.1016/j.sbi.2004.05.007 }}</ref> |
||
* '''Homology-based methods''' (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with [[expressed sequence tag]]s (ESTs), [[complementary DNA]] (cDNA), or [[protein]] sequences. |
* '''Homology-based methods''' (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with [[expressed sequence tag]]s (ESTs), [[complementary DNA]] (cDNA), or [[protein]] sequences. |
||
* '''Combiners'''. CDS prediction is done by a combination of both methods mentioned above. |
* '''Combiners'''. CDS prediction is done by a combination of both methods mentioned above. |
||
==Functional annotation== |
|||
Whereas [[prokaryotic]] CDS predictors mostly deal with [[open reading frames]] (ORFs), which are segments of DNA between the [[start codon|start]] and [[stop codon|stop]] [[genetic code|codons]],<ref name="Stein2001" /> [[eukaryotic]] CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes. To perform their task, eukaryotic CDS predictors employ methods called sensors that detect specific elements in the genome. There are two types of sensors:<ref name="mchardy2017">{{cite book |last1=McHardy |first1=A. C. |last2=Kloetgen |first2=A. |editor1-last=Keith |editor1-first=J. M. |title=Bioinformatics. Volume I, Data, sequence analysis and evolution |date=2017 |publisher=Springer |location=New York |isbn=978-1-4939-6622-6 |edition=Second |url=https://link.springer.com/book/10.1007/978-1-4939-6622-6 | pages=271-291 |archive-url=https://web.archive.org/web/20220121035511/https://link.springer.com/book/10.1007/978-1-4939-6622-6 |archive-date=21 January 2022 |chapter=Finding Genes in Genome Sequence}}</ref> |
|||
Functional annotation assigns functions to the genomic elements found by structural annotation,<ref name="bright2009" /> by relating them to biological processes such as the [[cell cycle]], [[cell death]], [[developmental biology|development]], [[metabolism]], etc.<ref name="Stein2001" /> It may also be used as an additional quality check by identifying elements that may have been annotated by error.<ref name="dominguez2018" /> |
|||
===Coding sequence function prediction=== |
|||
* '''Signal sensors'''. Identify functional site signals such as [[promoter (genetics)|promoters]] and [[polyadenylation|polyA sites]]. |
|||
* '''Content sensors'''. Classify DNA sequences into coding and noncoding content. |
|||
{{see also|Gene Ontology|Protein function prediction}} |
|||
[[File:Matrilin Complex GO ancestor chart.jpg|thumb|350px|An example [[Gene Ontology]] (GO) ancestor chart organized as a [[directed acyclic graph]] taken from [https://www.ebi.ac.uk/QuickGO/term/GO:0120216 QuickGO].<ref name="binns2009">{{cite journal | vauthors = Binns D, Dimmer E, Huntley R, Barrell D, O'Donovan C, Apweiler R | title = QuickGO: a web-based tool for Gene Ontology searching | journal = Bioinformatics | volume = 25 | issue = 22 | pages = 3045–3046 | date = November 2009 | pmid = 19744993 | pmc = 2773257 | doi = 10.1093/bioinformatics/btp536 }}</ref> It shows the molecular functions, biological processes, and cellular components in which the [[matrilin|matrilin complex]], a component of the [[extracellular matrix]], is involved. Every box is an ontology term that falls into one of the three GO categories and is color-coded respectively. Ontology terms are related to each other through specific qualifiers (such as "is a", "part of", etc.), which are represented by different kinds of arrows.]] |
|||
[[noncoding DNA|Noncoding sequence]]s (ncDNA) are those that do not code for proteins. The annotation of some of them (such as repeats and introns) is done in previous stages of the process (as outlined above). Other noncoding sequences, such as pseudogenes, segmental duplications, binding sites and RNA genes, require special procedures for their identification.<ref name="alexander2010" /> |
|||
Functional annotation of genes requires a controlled vocabulary (or ontology) to name the predicted functional features. However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups. As such, a standardized controlled vocabulary must be employed, the most comprehensive of which is the [[Gene Ontology]] (GO). It classifies functional properties into one of three categories (molecular function, biological process, and cellular component) and organizes them in a [[directed acyclic graph]], in which every node is a particular function, and every edge (or arrow) between two nodes indicates a parent-child or subcategory-category relationship.<ref name="vu2021">{{cite journal | vauthors = Vu TT, Jung J | title = Protein function prediction with gene ontology: from traditional to deep learning models | journal = PeerJ | volume = 9 | pages = e12019 | date = 2021 | pmid = 34513334 | pmc = 8395570 | doi = 10.7717/peerj.12019 | doi-access = free }}</ref><ref name="saxena2021" /> As of 2020, GO is the most widely used controlled vocabulary for functional annotation of genes, followed by the MIPS Functional Catalog (FunCat).<ref name="zhao2020">{{cite journal | vauthors = Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G | title = A Literature Review of Gene Function Prediction by Modeling Gene Ontology | journal = Frontiers in Genetics | volume = 11 | pages = 400 | date = 2020 | pmid = 32391061 | pmc = 7193026 | doi = 10.3389/fgene.2020.00400 | doi-access = free }}</ref> |
|||
[[Pseudogene]]s are mutated copies of protein-coding genes that lost their coding function due to a disruption in their [[open reading frame]] (ORF), making them [[translation (biology)|untranslatable]].<ref name="alexander2010" /> They may be identified using one of the following two methods:<ref name="dainat2021">{{cite book |last1=Dainat |first1=J. |last2=Pontarotti |first2=P. |editor1-last=Poliseno |editor1-first=L. |title=Pseudogenes : functions and protocols |date=2021 |publisher=Springer |location=New York |isbn=978-1-0716-1503-4 |pages=21-34 |edition=Second |url=https://link.springer.com/protocol/10.1007/978-1-0716-1503-4_2 |access-date=25 April 2023 |archive-url=https://web.archive.org/web/20230425052052/https://link.springer.com/book/10.1007/978-1-0716-1503-4 |archive-date=25 April 2023 |chapter=Methods to Identify and Study the Evolution of Pseudogenes Using a Phylogenetic Approach}}</ref> |
|||
Some conventional methods for functional annotation are [[sequence homology|homology]]-based, which rely on local [[sequence alignment|alignment]] search tools.<ref name="vu2021"/> Its premise is that high sequence conservation between two genomic elements implies that their function is conserved as well. Pairs of homologous sequences that appeared through [[Sequence homology#Paralogy|paralogy]], [[Sequence homology#Orthology|orthology]], or [[Sequence homology#Xenology|xenology]] usually perform a similar function. However, orthologous sequences should be treated with caution because of two reasons: (1) they might have different names depending on when they were originally annotated, and (2) they may not perform the same functional role in two different organisms. Annotators often refer to an [[convergent evolution|analogous]] sequence when no paralogy, orthology or xenology was found.<ref name="soh2012" /> Homology-based methods have several drawbacks, such as errors in the database, low sensitivity/specificity, inability to distinguish between paralogy and homology,<ref name="sasson2006">{{cite journal | vauthors = Sasson O, Kaplan N, Linial M | title = Functional annotation prediction: all for one and one for all | journal = Protein Science | volume = 15 | issue = 6 | pages = 1557–1562 | date = June 2006 | pmid = 16672244 | pmc = 2242553 | doi = 10.1110/ps.062185706 }}</ref> artificially high scores due to the presence of low complexity regions, and significant variation within a protein family.<ref name="sinha2020">{{cite journal | vauthors = Sinha S, Lynn AM, Desai DK | title = Implementation of homology based and non-homology based computational methods for the identification and annotation of orphan enzymes: using Mycobacterium tuberculosis H37Rv as a case study | journal = BMC Bioinformatics | volume = 21 | issue = 1 | pages = 466 | date = October 2020 | pmid = 33076816 | pmc = 574302 | doi = 10.1186/s12859-020-03794-x | doi-access = free }}</ref> |
|||
Functional annotation can be performed through probabilistic methods. The distribution of [[hydrophilic]] and [[hydrophobic]] [[amino acid]]s indicates whether a protein is located in a solution or membrane. Specific [[sequence motif]]s provide information on [[posttranslational modifications]] and final location of any given protein.<ref name="soh2012" /> Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example, [[protein-protein interaction]] (PPI) networks usually place proteins with similar functions close to each other.<ref name="letovsky2003">{{cite journal | vauthors = Letovsky S, Kasif S | title = Predicting protein function from protein/protein interaction data: a probabilistic approach | journal = Bioinformatics | volume = 19 | issue = Suppl 1 | pages = i197–i204 | date = 2003 | pmid = 12855458 | doi = 10.1093/bioinformatics/btg1026 | doi-access = free }}</ref> |
|||
[[Machine learning]] methods are also used to generate functional annotations for novel proteins based on GO terms. Generally, they consist in constructing a [[binary classifier]] for each GO term, which are then joined to make predictions on individual GO terms (forming a [[multiclass classifier]]) for which confidence scores are later obtained. The [[support vector machine]] (SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as [[k-nearest neighbors]] (kNN) and [[convolutional neural network]] (CNN), have also been employed.<ref name="vu2021" /> |
|||
Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms. More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does. Some of these methods compress the GO terms by [[matrix factorization]] or by [[hash function|hashing]], thus boosting their performance.<ref name="zhao2020" /> |
|||
===Noncoding sequence function prediction=== |
|||
[[noncoding DNA|Noncoding sequence]]s (ncDNA) are those that do not code for proteins. They include elements such as pseudogenes, segmental duplications, binding sites and RNA genes.<ref name="alexander2010" /> |
|||
[[Pseudogene]]s are mutated copies of protein-coding genes that lost their coding function due to a disruption in their [[open reading frame]] (ORF), making them [[translation (biology)|untranslatable]].<ref name="alexander2010" /> They may be identified using one of the following two methods:<ref name="dainat2021">{{cite book | vauthors = Dainat J, Pontarotti P |title=Pseudogenes |chapter=Methods to Identify and Study the Evolution of Pseudogenes Using a Phylogenetic Approach | veditors = Poliseno L |series=Methods in Molecular Biology |date=2021 |volume=2324 |publisher=Springer |location=New York |isbn=978-1-0716-1503-4 |pages=21–34 |edition=Second | doi = 10.1007/978-1-0716-1503-4_2 |pmid=34165706 |s2cid=235625288 |chapter-url=https://hal.archives-ouvertes.fr/hal-03389474/file/2.%20Dainat%20chapter_230520LP_JD.pdf }}</ref> |
|||
* '''Homology-based method'''. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened. |
* '''Homology-based method'''. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened. |
||
* '''Phylogeny-based method'''. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as [[RNA-Seq]] data analysis, [[Ka/Ks ratio|dN/dS ratio]], etc.), it means that the sequence is indeed a pseudogene. |
* '''Phylogeny-based method'''. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as [[RNA-Seq]] data analysis, [[Ka/Ks ratio|dN/dS ratio]], etc.), it means that the sequence is indeed a pseudogene. |
||
[[Segmental duplication]]s are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD:<ref name="numanagic2018">{{cite journal | |
[[Segmental duplication]]s are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD:<ref name="numanagic2018">{{cite journal | vauthors = Numanagic I, Gökkaya AS, Zhang L, Berger B, Alkan C, Hach F | title = Fast characterization of segmental duplications in genome assemblies | journal = Bioinformatics | volume = 34 | issue = 17 | pages = i706–i714 | date = September 2018 | pmid = 30423092 | pmc = 6129265 | doi = 10.1093/bioinformatics/bty586 }}</ref> |
||
* '''Whole-Genome Assembly Comparison''' (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly. |
* '''Whole-Genome Assembly Comparison''' (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly. |
||
* '''Whole-genome Shotgun Sequence Detection''' (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region.<ref name="hartasanchez2018">{{cite journal | |
* '''Whole-genome Shotgun Sequence Detection''' (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region.<ref name="hartasanchez2018">{{cite journal | vauthors = Hartasánchez DA, Brasó-Vives M, Heredia-Genestar JM, Pybus M, Navarro A | title = Effect of Collapsed Duplications on Diversity Estimates: What to Expect | journal = Genome Biology and Evolution | volume = 10 | issue = 11 | pages = 2899–2905 | date = November 2018 | pmid = 30364947 | pmc = 6239678 | doi = 10.1093/gbe/evy223 }}</ref> |
||
[[DNA binding site]]s are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in [[DNA replication]] and [[DNA repair|repair]], [[transcriptional regulation]], and [[viral disease|viral infection]]. Binding site prediction involves the use of one of the following two methods:<ref name="si2015">{{cite journal | |
[[DNA binding site]]s are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in [[DNA replication]] and [[DNA repair|repair]], [[transcriptional regulation]], and [[viral disease|viral infection]]. Binding site prediction involves the use of one of the following two methods:<ref name="si2015">{{cite journal | vauthors = Si J, Zhao R, Wu R | title = An overview of the prediction of protein DNA-binding sites | journal = International Journal of Molecular Sciences | volume = 16 | issue = 3 | pages = 5194–5215 | date = March 2015 | pmid = 25756377 | pmc = 4394471 | doi = 10.3390/ijms16035194 | doi-access = free }}</ref> |
||
* '''Sequence similarity based methods'''. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less [[conserved sequence|conserved]]. |
* '''Sequence similarity based methods'''. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less [[conserved sequence|conserved]]. |
||
* '''Structure based methods'''. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites. |
* '''Structure based methods'''. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites. |
||
[[Noncoding RNA]] (ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as [[tRNA]], [[rRNA]], [[snoRNA]], and [[microRNA]], as well as noncoding [[mRNA]]-like transcripts. ''Ab initio'' prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.<ref name="griffiths2007">{{cite journal | |
[[Noncoding RNA]] (ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as [[tRNA]], [[rRNA]], [[snoRNA]], and [[microRNA]], as well as noncoding [[mRNA]]-like transcripts. ''Ab initio'' prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.<ref name="griffiths2007">{{cite journal | vauthors = Griffiths-Jones S | title = Annotating noncoding RNA genes | journal = Annual Review of Genomics and Human Genetics | volume = 8 | pages = 279–298 | date = 2007 | pmid = 17506659 | doi = 10.1146/annurev.genom.8.080706.092419 }}</ref> |
||
===Databases for structural annotation=== |
|||
{{also|List of biological databases}} |
|||
Both homology-based and ab initio structural annotation methods require supporting data stored in databases. The former use this data for alignment, while the latter use it for training and validation. The following table contains some databases often used in structural annotation pipelines.<ref name="ejigu2020" /> |
|||
{| class="wikitable" |
|||
|+ |
|||
|- |
|||
! Name !! Used for !! Link |
|||
|- |
|||
| [[GenBank]]<ref name="sayers2019">{{cite journal |last1=Sayers |first1=EW |last2=Cavanaugh |first2=M |last3=Clark |first3=K |last4=Ostell |first4=J |last5=Pruitt |first5=KD |last6=Karsch-Mizrachi |first6=I |title=GenBank. |journal=Nucleic acids research |date=8 January 2019 |volume=47 |issue=D1 |pages=D94-D99 |doi=10.1093/nar/gky989 |pmid=30365038 |url=https://academic.oup.com/nar/article/47/D1/D94/5144964 |access-date=25 April 2023 |pmc=6323954}}</ref> || [[Nucleotide]] sequences || https://www.ncbi.nlm.nih.gov/genbank/ |
|||
|- |
|||
| [[European Nucleotide Archive]] (ENA)<ref name="brooksbank2014">{{cite journal |last1=Brooksbank |first1=C |last2=Bergman |first2=MT |last3=Apweiler |first3=R |last4=Birney |first4=E |last5=Thornton |first5=J |title=The European Bioinformatics Institute's data resources 2014. |journal=Nucleic acids research |date=January 2014 |volume=42 |issue=Database issue |pages=D18-25 |doi=10.1093/nar/gkt1206 |pmid=24271396 |url=https://academic.oup.com/nar/article/42/D1/D18/1060862 |access-date=25 April 2023 |pmc=3964968}}</ref> || [[Nucleotide]] sequences || https://www.ebi.ac.uk/ena/browser/ |
|||
|- |
|||
| [[DNA Data Bank of Japan]] (DDBJ)<ref name="kodama2023">{{cite journal |last1=Kodama |first1=Y |last2=Mashima |first2=J |last3=Kosuge |first3=T |last4=Kaminuma |first4=E |last5=Ogasawara |first5=O |last6=Okubo |first6=K |last7=Nakamura |first7=Y |last8=Takagi |first8=T |title=DNA Data Bank of Japan: 30th anniversary. |journal=Nucleic acids research |date=4 January 2018 |volume=46 |issue=D1 |pages=D30-D35 |doi=10.1093/nar/gkx926 |pmid=29040613 |url=https://academic.oup.com/nar/article/46/D1/D30/4429162 |access-date=25 April 2023 |pmc=5753283}}</ref> || [[Nucleotide]] sequences || http://www.ddbj.nig.ac.jp/ |
|||
|- |
|||
| [[UniProt]]<ref name="uniprot2019">{{cite journal |last1=UniProt Consortium |title=UniProt: a worldwide hub of protein knowledge. |journal=Nucleic acids research |date=8 January 2019 |volume=47 |issue=D1 |pages=D506-D515 |doi=10.1093/nar/gky1049 |pmid=30395287 |url=https://academic.oup.com/nar/article/47/D1/D506/5160987 |access-date=25 April 2023 |pmc=6323992}}</ref> || [[Protein]] sequences || https://www.uniprot.org/ |
|||
|- |
|||
| [[InterPro]]<ref name="mitchell2019">{{cite journal |last1=Mitchell |first1=AL |last2=Attwood |first2=TK |last3=Babbitt |first3=PC |last4=Blum |first4=M |last5=Bork |first5=P |last6=Bridge |first6=A |last7=Brown |first7=SD |last8=Chang |first8=HY |last9=El-Gebali |first9=S |last10=Fraser |first10=MI |last11=Gough |first11=J |last12=Haft |first12=DR |last13=Huang |first13=H |last14=Letunic |first14=I |last15=Lopez |first15=R |last16=Luciani |first16=A |last17=Madeira |first17=F |last18=Marchler-Bauer |first18=A |last19=Mi |first19=H |last20=Natale |first20=DA |last21=Necci |first21=M |last22=Nuka |first22=G |last23=Orengo |first23=C |last24=Pandurangan |first24=AP |last25=Paysan-Lafosse |first25=T |last26=Pesseat |first26=S |last27=Potter |first27=SC |last28=Qureshi |first28=MA |last29=Rawlings |first29=ND |last30=Redaschi |first30=N |last31=Richardson |first31=LJ |last32=Rivoire |first32=C |last33=Salazar |first33=GA |last34=Sangrador-Vegas |first34=A |last35=Sigrist |first35=CJA |last36=Sillitoe |first36=I |last37=Sutton |first37=GG |last38=Thanki |first38=N |last39=Thomas |first39=PD |last40=Tosatto |first40=SCE |last41=Yong |first41=SY |last42=Finn |first42=RD |title=InterPro in 2019: improving coverage, classification and access to protein sequence annotations. |journal=Nucleic acids research |date=8 January 2019 |volume=47 |issue=D1 |pages=D351-D360 |doi=10.1093/nar/gky1100 |pmid=30398656 |url=https://academic.oup.com/nar/article/47/D1/D351/5162469 |access-date=25 April 2023 |pmc=6323941}}</ref> || [[Protein]] sequences || http://www.ebi.ac.uk/interpro/ |
|||
|- |
|||
| [[NONCODE]]<ref name="fang2018">{{cite journal |last1=Fang |first1=S |last2=Zhang |first2=L |last3=Guo |first3=J |last4=Niu |first4=Y |last5=Wu |first5=Y |last6=Li |first6=H |last7=Zhao |first7=L |last8=Li |first8=X |last9=Teng |first9=X |last10=Sun |first10=X |last11=Sun |first11=L |last12=Zhang |first12=MQ |last13=Chen |first13=R |last14=Zhao |first14=Y |title=NONCODEV5: a comprehensive annotation database for long non-coding RNAs. |journal=Nucleic acids research |date=4 January 2018 |volume=46 |issue=D1 |pages=D308-D314 |doi=10.1093/nar/gkx1107 |pmid=29140524 |url=https://academic.oup.com/nar/article/46/D1/D308/4616876 |access-date=25 April 2023 |pmc=5753287}}</ref> || [[Long noncoding RNA]] sequences || http://www.noncode.org/ |
|||
|- |
|||
| [[Pseudogene (database)|Pseudogene.org]]<ref name="karro2007">{{cite journal |last1=Karro |first1=JE |last2=Yan |first2=Y |last3=Zheng |first3=D |last4=Zhang |first4=Z |last5=Carriero |first5=N |last6=Cayting |first6=P |last7=Harrrison |first7=P |last8=Gerstein |first8=M |title=Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. |journal=Nucleic acids research |date=January 2007 |volume=35 |issue=Database issue |pages=D55-60 |doi=10.1093/nar/gkl851 |pmid=17099229 |url=https://academic.oup.com/nar/article/35/suppl_1/D55/1100825 |access-date=25 April 2023 |pmc=1669708}}</ref> || [[Pseudogene]] sequences || http://www.pseudogene.org/ |
|||
|- |
|||
| Dfam<ref name="hubley2016">{{cite journal |last1=Hubley |first1=R |last2=Finn |first2=RD |last3=Clements |first3=J |last4=Eddy |first4=SR |last5=Jones |first5=TA |last6=Bao |first6=W |last7=Smit |first7=AF |last8=Wheeler |first8=TJ |title=The Dfam database of repetitive DNA families. |journal=Nucleic acids research |date=4 January 2016 |volume=44 |issue=D1 |pages=D81-9 |doi=10.1093/nar/gkv1272 |pmid=26612867 |url=https://academic.oup.com/nar/article/44/D1/D81/2503084 |access-date=25 April 2023 |pmc=4702899}}</ref> || [[Transposon]] sequences || https://www.dfam.org/home |
|||
|- |
|||
| [[miRBase]]<ref name="kozomara2019">{{cite journal |last1=Kozomara |first1=A |last2=Birgaoanu |first2=M |last3=Griffiths-Jones |first3=S |title=miRBase: from microRNA sequences to function. |journal=Nucleic acids research |date=8 January 2019 |volume=47 |issue=D1 |pages=D155-D162 |doi=10.1093/nar/gky1141 |pmid=30423142 |url=https://academic.oup.com/nar/article/47/D1/D155/5179337 |access-date=25 April 2023 |pmc=6323917}}</ref> || [[MicroRNA]] sequences || http://www.mirbase.org/ |
|||
|} |
|||
==Visualization== |
==Visualization== |
||
[[File:GBK File Snapshot.svg|thumb|A snapshot of an annotated GBK file created with Prokka.<ref name="seemann2014" /> It shows the components (features) of a small portion of ''[[Candidatus Carsonella ruddii]]'''s genome, including their positions (structural annotation) and inferred functions (functional annotation).]] |
|||
===File formats=== |
===File formats=== |
||
Visualization of annotations in a [[genome browser]] requires a descriptive output file, which should describe the [[intron]]-[[exon]] structures of each annotation, their start and stop [[codons]], UTRs and alternative transcripts, and ideally should include information about the [[sequence alignment]]s |
Visualization of annotations in a [[genome browser]] requires a descriptive output file, which should describe the [[intron]]-[[exon]] structures of each annotation, their start and stop [[codons]], UTRs and alternative transcripts, and ideally should include information about the [[sequence alignment]]s and [[gene prediction]]s that support each gene model. Some commonly used formats for describing annotations are GenBank, [[GFF3]], GTF, [[BED (file format)|BED]] and EMBL.<ref name="yandell2012">{{cite journal | vauthors = Yandell M, Ence D | title = A beginner's guide to eukaryotic genome annotation | journal = Nature Reviews. Genetics | volume = 13 | issue = 5 | pages = 329–342 | date = April 2012 | pmid = 22510764 | doi = 10.1038/nrg3174 | s2cid = 3352427 }}</ref> Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.<ref name="dominguez2018"/> |
||
===Genome browsers=== |
===Genome browsers=== |
||
{{main|Genome browser}} |
{{main|Genome browser}} |
||
Genomic browsers are software products that simplify the analysis and visualization of large genomic sequence and annotation data to gain biological insight, via a graphical interface. |
Genomic browsers are software products that simplify the analysis and visualization of large genomic sequence and annotation data to gain biological insight, via a graphical interface.<ref name="valeev2013">{{cite journal | vauthors = Valeev T, Yevshin I, Kolpakov F |title=BioUML Genome Browser |journal=Virtual Biology |date=2013 |volume=1 |issue=1 |pages=15 |doi=10.12704/vb/e8|doi-access=free }}</ref><ref name="ejigu2020">{{cite journal | vauthors = Ejigu GF, Jung J | title = Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing | journal = Biology | volume = 9 | issue = 9 | pages = 295 | date = September 2020 | pmid = 32962098 | doi = 10.3390/biology9090295 | pmc = 7565776 | doi-access = free }}</ref><ref name="szot2017">{{cite journal | vauthors = Szot PS, Yang A, Wang X, Parsania C, Röhm U, Wong KH, Ho JW | title = PBrowse: a web-based platform for real-time collaborative exploration of genomic data | journal = Nucleic Acids Research | volume = 45 | issue = 9 | pages = e67 | date = May 2017 | pmid = 28100700 | doi = 10.1093/nar/gkw1358 | pmc = 5605237 }}</ref> |
||
Genomic browsers can be divided into '''web-based genomic browsers''' and '''stand-alone genomic browsers'''. The former use information from databases and can be classified into ''multiple-species'' (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and ''species-specific'' (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.<ref name="wang2013">{{cite journal | vauthors = Wang J, Kong L, Gao G, Luo J | title = A brief introduction to web-based genome browsers | journal = Briefings in Bioinformatics | volume = 14 | issue = 2 | pages = 131–143 | date = March 2013 | pmid = 22764121 | doi = 10.1093/bib/bbs029 | doi-access = free }}</ref><ref name="soh2012"/> |
|||
===Comparative visualization of genomes=== |
|||
[[File:Viruses-07-02761-g005.png|thumb|A linear comparative genome visualization of several [[type species]] of phylogenetically related [[virus|viral]] [[Family (biology)|families]] and [[genus|genera]]. Functional annotations of proteins are displayed in distinct colors and homologies in different tones.]] |
|||
[[Comparative genomics]] aims to identify similarities and differences in genomic features, as well as to examine evolutionary relationships between organisms.<ref name="jung2019">{{cite journal | vauthors = Jung J, Kim JI, Yi G | title = geneCo: a visualized comparative genomic method to analyze multiple genome structures | journal = Bioinformatics | volume = 35 | issue = 24 | pages = 5303–5305 | date = December 2019 | pmid = 31350879 | doi = 10.1093/bioinformatics/btz596 | pmc = 6954651 }}</ref> Visualization tools capable of illustrating the comparative behavior between two or more genomes are essential for this approach, and can be classified into three categories based on the representation of the relationships between the compared genomes:<ref name="soh2012" /> |
|||
* '''Dot Plots:''' This scheme only allows to show the alignment of two genomes, one genome is represented along the horizontal axis and the other along the vertical axis and the dots in the plot represent the genomic elements that are similar between these two annotations. |
|||
* '''Linear representation:''' This representation uses multiple linear tracks to represent multiple genomes and their features where "track" is a concept that refers to a specific type of genomic feature at a genomic location. |
|||
* '''Circular representation:''' This representation facilitates comparison of whole microbial or viral genomes. In this visualization mode, concentric circles and arcs are used to represent genomic sections. |
|||
==Quality control== |
|||
The quality of the [[sequence assembly]] influences the quality of the annotation, so it is important to assess assembly quality before performing the subsequent annotation steps.<ref name="ejigu2020" /> In order to quantify the quality of a genome annotation, three metrics have been used: [[Precision and recall#Recall|recall]], [[Precision and recall#Precision|precision]] and [[Accuracy and precision#In multiclass classification|accuracy]]; although these measures are not explicitly used in annotation projects, but rather in discussions of prediction accuracy.<ref name="ouzounis2002">{{cite journal | vauthors = Ouzounis CA, Karp PD | title = The past, present and future of genome-wide re-annotation | journal = Genome Biology | volume = 3 | issue = 2 | pages = COMMENT2001 | date = 2002 | pmid = 11864365 | doi = 10.1186/gb-2002-3-2-comment2001 | pmc = 139008 | doi-access = free }}</ref> |
|||
Community annotation approaches are great techniques for quality control and standardization in genome annotation. An annotation jamboree that took part in 2002, led to the creation of the annotation standards used by the Sanger Institute's Human and Vertebrate Analysis Project (HAVANA).<ref name="webSI">{{cite web |title=Manual Annotation - Wellcome Sanger Institute |url=https://www.sanger.ac.uk/project/manual-annotation/ |website=www.sanger.ac.uk |access-date=28 Mar 2023 |archive-url=https://web.archive.org/web/20230202174747/https://www.sanger.ac.uk/project/manual-annotation/ |archive-date=2 Feb 2023}}</ref><ref name="brent2005" /> |
|||
===Re-annotation=== |
|||
Annotation projects often rely on previous annotations of an organism's genome; however, these older annotations may contain errors that can propagate to new annotations. As new genome analysis technologies are developed and richer databases become available, the annotation of some older genomes may be updated. This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions. Re-annotation is therefore a useful approach in quality control.<ref name="ouzounis2002" /><ref name="siezen2010">{{cite journal | vauthors = Siezen RJ, van Hijum SA | title = Genome (re-)annotation and open-source annotation pipelines | journal = Microbial Biotechnology | volume = 3 | issue = 4 | pages = 362–369 | date = July 2010 | pmid = 21255336 | doi = 10.1111/j.1751-7915.2010.00191.x | pmc = 3815804 }}</ref> |
|||
Genomic browsers can be divided into '''web-based genomic browsers''' and '''stand-alone genomic browsers'''. The former use information from databases and can be classified into ''multiple-species'' (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and ''species-specific'' (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.<ref name="wang2013">{{cite journal |last1=Wang |first1=J. |last2=Kong |first2=L. |last3=Gao |first3=G. |last4=Luo |first4=J. |title=A brief introduction to web-based genome browsers |journal=Briefings in Bioinformatics |date=2013 |volume=14 |issue=2 |pages=131–143 |doi=10.1093/bib/bbs029}}</ref> <ref name="soh2012"/> |
|||
==Community annotation== |
==Community annotation== |
||
Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories: |
Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories:<ref name="loveland2012">{{cite journal | vauthors = Loveland JE, Gilbert JG, Griffiths E, Harrow JL | title = Community gene annotation in practice | journal = Database | volume = 2012 | issue = 2012 | pages = bas009 | date = 2012 | pmid = 22434843 | doi = 10.1093/database/bas009 | pmc = 3308165 }}</ref><ref name="Stein2001" /> |
||
* '''Factory model:''' Annotation is performed by a completely automated pipeline. |
* '''Factory model:''' Annotation is performed by a completely automated pipeline. |
||
* '''Museum model:''' [[Biocuration|Manual curation]] by experts is involved to interpret the results of an annotation project. |
* '''Museum model:''' [[Biocuration|Manual curation]] by experts is involved to interpret the results of an annotation project. |
||
* '''Cottage industry model:''' Annotation is decentralized and is the result of the effort from different part-time curators. |
* '''Cottage industry model:''' Annotation is decentralized and is the result of the effort from different part-time curators. |
||
* '''Party or jamboree model:''' Consists of a short intensive workshop with leading curators from the community. It was first used in the ''[[Drosophila melanogaster]]'' genome annotation project. |
* '''Party or jamboree model:''' Consists of a short intensive workshop with leading curators from the community. It was first used in the ''[[Drosophila melanogaster]]'' genome annotation project.<ref name="hartl2000">{{cite journal | vauthors = Hartl DL | title = Fly meets shotgun: shotgun wins | journal = Nature Genetics | volume = 24 | issue = 4 | pages = 327–328 | date = April 2000 | pmid = 10742085 | doi = 10.1038/74125 | s2cid = 5354139 }}</ref> |
||
* '''Blessed annotator:''' A variation of the museum model, applied in the [https://www.genome.gov/17515708#:~:text=The%20Knockout%20Mouse%20Project%20%28KOMP%29%20is%20a%20trans-National,%28See%3A%20The%20NIH%20Knockout%20Mouse%20Project%20Website%20%5Bnih.gov%5D%29/ Knockout Mouse Project (KOMP)], in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work. |
* '''Blessed annotator:''' A variation of the museum model, applied in the [https://www.genome.gov/17515708#:~:text=The%20Knockout%20Mouse%20Project%20%28KOMP%29%20is%20a%20trans-National,%28See%3A%20The%20NIH%20Knockout%20Mouse%20Project%20Website%20%5Bnih.gov%5D%29/ Knockout Mouse Project (KOMP)], in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work. |
||
* '''Gatekeeper approach:''' It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data. |
* '''Gatekeeper approach:''' It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data. |
||
A community annotation is said to be ''supervised'' when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called ''unsupervised'' community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication. |
A community annotation is said to be ''supervised'' when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called ''unsupervised'' community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication.<ref name="mazumder2010">{{cite journal | vauthors = Mazumder R, Natale DA, Julio JA, Yeh LS, Wu CH | title = Community annotation in biology | journal = Biology Direct | volume = 5 | issue = 1 | pages = 12 | date = February 2010 | pmid = 20167071 | doi = 10.1186/1745-6150-5-12 | pmc = 2834641 | doi-access = free }}</ref> |
||
Wikipedia has multiple WikiProjects aimed at improving annotation. The [[Wikipedia:WikiProject Molecular Biology/Genetics/Gene Wiki|Gene WikiProject]], for instance, operates a [[internet bot|bot]] that harvests gene data from research databases and creates gene [[Wikipedia:Stub|stubs]] on that basis. |
Wikipedia has multiple WikiProjects aimed at improving annotation. The [[Wikipedia:WikiProject Molecular Biology/Genetics/Gene Wiki|Gene WikiProject]], for instance, operates a [[internet bot|bot]] that harvests gene data from research databases and creates gene [[Wikipedia:Stub|stubs]] on that basis. |
||
<ref name=Huss2008>{{cite journal | vauthors = Huss JW, Orozco C, Goodale J, Wu C, Batalov S, Vickers TJ, Valafar F, Su AI | display-authors = 6 | title = A gene wiki for community annotation of gene function | journal = PLOS Biology | volume = 6 | issue = 7 | pages = e175 | date = July 2008 | pmid = 18613750 | pmc = 2443188 | doi = 10.1371/journal.pbio.0060175 | doi-access = free }}</ref> |
|||
<ref name=Huss2008>{{Cite journal |
|||
The [[Wikipedia:WikiProject Molecular Biology/RNA|RNA WikiProject]] seeks to write articles that describe individual RNAs and RNA families in an accessible way.<ref name="daub2008">{{cite journal | vauthors = Daub J, Gardner PP, Tate J, Ramsköld D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A | display-authors = 6 | title = The RNA WikiProject: community annotation of RNA families | journal = RNA | volume = 14 | issue = 12 | pages = 2462–2464 | date = December 2008 | pmid = 18945806 | doi = 10.1261/rna.1200508 | pmc = 2590952 }}</ref> |
|||
| last1 = Huss | first1 = Jon W. |
|||
| year = 2008 |
|||
| title = A Gene Wiki for Community Annotation of Gene Function |
|||
| journal = [[PLOS Biology]] |
|||
| volume = 6 |
|||
| pages = e175 |
|||
| doi = 10.1371/journal.pbio.0060175 |
|||
| pmid = 18613750 |
|||
| last2 = Orozco |
|||
| first2 = C |
|||
| last3 = Goodale |
|||
| first3 = J |
|||
| last4 = Wu |
|||
| first4 = C |
|||
| last5 = Batalov |
|||
| first5 = S |
|||
| last6 = Vickers |
|||
| first6 = TJ |
|||
| last7 = Valafar |
|||
| first7 = F |
|||
| last8 = Su |
|||
| first8 = AI |
|||
| issue = 7 |
|||
| pmc = 2443188 |
|||
}}</ref> |
|||
The [[Wikipedia:WikiProject Molecular Biology/RNA|RNA WikiProject]] seeks to write articles that describe individual RNAs and RNA families in an accessible way. <ref name="daub2008">{{cite journal |last1=Daub |first1=J. |last2=Gardner |first2=P. P. |last3=Tate |first3=J. |last4=Ramsköld |first4=D. |last5=Manske |first5=M. |last6=Scott |first6=W. G. |last7=Weinberg |first7=Z. |last8=Griffiths-Jones |first8=S. |last9=Bateman |first9=A. |title=The RNA WikiProject: Community annotation of RNA families |journal=RNA |date=December 2008 |volume=14 |issue=12 |pages=2462–2464 |doi=10.1261/rna.1200508 |pmid=18945806 |url=https://pubmed.ncbi.nlm.nih.gov/18945806/}}</ref> |
|||
==Applications== |
==Applications== |
||
===Disease diagnosis=== |
===Disease diagnosis=== |
||
Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy.<ref name="saxena2021">{{cite book | |
Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy.<ref name="saxena2021">{{cite book | vauthors = Saxena R, Bishnoi R, Singla D | chapter = Gene Ontology: application and importance in functional annotation of the genomic data | veditors = Singh B, Pathak RK |title=Bioinformatics : methods and applications |date=2021 |publisher=Academic Press |location=London |isbn=978-0-323-89775-4 |pages=145–157 |doi = 10.1016/B978-0-323-89775-4.00015-8 }}</ref> |
||
Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology,<ref name="cooper2016">{{cite book | |
Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology,<ref name="cooper2016">{{cite book | vauthors = Cooper L, Jaiswal P | title = Plant Bioinformatics | chapter = The Plant Ontology: A Tool for Plant Genomics | veditors = Edwards D | series = Methods in Molecular Biology |date=2016 | volume = 1374 |publisher=Humana Press |location=Totowa, N.J. |isbn=978-1-4939-3167-5 |pages=89–114 |edition=2nd | doi = 10.1007/978-1-4939-3167-5_5 | pmid = 26519402 }}</ref> Plant-Associated Microbe Gene Ontology<ref name="torto2009">{{cite journal | vauthors = Torto-Alalibo T, Collmer CW, Gwinn-Giglio M | title = The Plant-Associated Microbe Gene Ontology (PAMGO) Consortium: community development of new Gene Ontology terms describing biological processes involved in microbe-host interactions | journal = BMC Microbiology | volume = 9 | issue = Suppl 1 | pages = S1 | date = February 2009 | pmid = 19278549 | doi = 10.1186/1471-2180-9-S1-S1 | pmc = 2654661 | doi-access = free }}</ref> or DisGeNET.<ref name="piñero2020">{{cite journal | vauthors = Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI | title = The DisGeNET knowledge platform for disease genomics: 2019 update | journal = Nucleic Acids Research | volume = 48 | issue = D1 | pages = D845–D855 | date = January 2020 | pmid = 31680165 | doi = 10.1093/nar/gkz1021 | pmc = 7145631 }}</ref> And some others have been implemented in pre-existing databases like Rat Disease Ontology in the Rat Genome database.<ref name="thomas2016">{{cite journal | vauthors = Hayman GT, Laulederkind SJ, Smith JR, Wang SJ, Petri V, Nigam R, Tutaj M, De Pons J, Dwinell MR, Shimoyama M | display-authors = 6 | title = The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database | journal = Database | volume = 2016 | date = 2016 | pages = baw034 | pmid = 27009807 | doi = 10.1093/database/baw034 | pmc = 4805243 }}</ref> |
||
===Bioremediation=== |
===Bioremediation=== |
||
A great diversity of [[catabolic]] [[enzymes]] involved in [[hydrocarbon]] degradation by some bacterial strains are encoded by genes located in their [[mobile genetic elements]] (MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.<ref name="top2002">{{cite journal | |
A great diversity of [[catabolic]] [[enzymes]] involved in [[hydrocarbon]] degradation by some bacterial strains are encoded by genes located in their [[mobile genetic elements]] (MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.<ref name="top2002">{{cite journal | vauthors = Top EM, Springael D, Boon N | title = Catabolic mobile genetic elements and their potential use in bioaugmentation of polluted soils and waters | journal = FEMS Microbiology Ecology | volume = 42 | issue = 2 | pages = 199–208 | date = November 2002 | pmid = 19709279 | doi = 10.1111/j.1574-6941.2002.tb01009.x | s2cid = 15173391 | doi-access = free | hdl = 1854/LU-348539 | hdl-access = free }}</ref> |
||
In 2013, Phale et al.<ref name="phale2013">{{cite journal | |
In 2013, Phale et al.<ref name="phale2013">{{cite journal | vauthors = Phale PS, Paliwal V, Raju SC, Modak A, Purohit HJ | title = Genome Sequence of Naphthalene-Degrading Soil Bacterium Pseudomonas putida CSV86 | journal = Genome Announcements | volume = 1 | issue = 1 | pages = 234–235 | date = January 2013 | pmid = 23469351 | doi = 10.1128/genomeA.00234-12 | pmc = 3587945 }}</ref> published the genome annotation of a strain of ''[[Pseudomonas putida]]'' (CSV86), a bacterium known for its preference of [[naphthalene]] and other [[aromatic compounds]] over [[glucose]] as a carbon and energy source. |
||
In order to find the MGEs of this bacterium, its genome was annotated using RAST and the [https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ NCBI Prokaryotic Genome Annotation Pipeline] (PGAP), and the identification of nine mobile elements was possible with the [https://www-is.biotoul.fr/ Insertion Sequence (IS) Finder] database. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation,<ref name="trivedi2016>{{cite journal | |
In order to find the MGEs of this bacterium, its genome was annotated using RAST and the [https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ NCBI Prokaryotic Genome Annotation Pipeline] (PGAP), and the identification of nine mobile elements was possible with the [https://www-is.biotoul.fr/ Insertion Sequence (IS) Finder] database. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation,<ref name="trivedi2016">{{cite journal | vauthors = Trivedi VD, Jangir PK, Sharma R, Phale PS | title = Insights into functional and evolutionary analysis of carbaryl metabolic pathway from Pseudomonas sp. strain C5pp | journal = Scientific Reports | volume = 6 | issue = 1 | pages = 38430 | date = December 2016 | pmid = 27924916 | doi = 10.1038/srep38430 | pmc = 5141477 | bibcode = 2016NatSR...638430T }}</ref> right next to the [[genes]] encoding tRNA-Gly and integrase, as well as the identification of the genes encoding enzymes involved in the degradation of [[salicylate]], [[benzoate]], [[4-hydroxybenzoate]], [[phenylacetic acid]], hydroxyphenyl acetic acid, and the recognition of an [[operon]] involved in glucose transport in the strain. |
||
[[Gene Ontology]] analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants.<ref name=" |
[[Gene Ontology]] analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants. This was the approach of the investigation and identification of [[Halomonas|''Halomonas zincidurans'']] strain B6(T), a bacterium with thirty-one genes encoding resistance to [[heavy metals]], especially zinc<ref name="huo2014">{{cite journal | vauthors = Huo YY, Li ZY, Cheng H, Wang CS, Xu XW | title = High quality draft genome sequence of the heavy metal resistant bacterium Halomonas zincidurans type strain B6(T) | journal = Standards in Genomic Sciences | volume = 9 | issue = 30 | pages = 30 | date = 2014 | pmid = 25945155 | doi = 10.1186/1944-3277-9-30 | pmc = 4286145 | doi-access = free }}</ref> and [[Stenotrophomonas|''Stenotrophomonas sp.'']] DDT-1, a strain capable of using [[DDT]] as its sole carbon and energy source,<ref name="pan2016">{{cite journal | vauthors = Pan X, Lin D, Zheng Y, Zhang Q, Yin Y, Cai L, Fang H, Yu Y | display-authors = 6 | title = Biodegradation of DDT by Stenotrophomonas sp. DDT-1: Characterization and genome functional analysis | journal = Scientific Reports | volume = 6 | issue = 1 | pages = 21332 | date = February 2016 | pmid = 26888254 | doi = 10.1038/srep21332 | pmc = 4758049 | bibcode = 2016NatSR...621332P }}</ref> to mention a few examples. |
||
==Software== |
==Software== |
||
Genes in a [[Eukaryote|eukaryotic]] genome can be annotated using various annotation tools<ref name="gaas2022">{{Citation |title=GAAS |date=2022-04-13 |url=https://github.com/NBISweden/GAAS/blob/07bd49a1623c0e13f77b1747b124d6d9385b67b7/annotation/knowledge/annotation_tools_genome.md |publisher=NBIS -- National Bioinformatics Infrastructure Sweden |access-date=2022-04-25}}</ref> such as FINDER.<ref name="pmid33879057">{{cite journal |vauthors=Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM | |
Genes in a [[Eukaryote|eukaryotic]] genome can be annotated using various annotation tools<ref name="gaas2022">{{Citation |title=GAAS |date=2022-04-13 |url=https://github.com/NBISweden/GAAS/blob/07bd49a1623c0e13f77b1747b124d6d9385b67b7/annotation/knowledge/annotation_tools_genome.md |publisher=NBIS -- National Bioinformatics Infrastructure Sweden |access-date=2022-04-25}}</ref> such as FINDER.<ref name="pmid33879057">{{cite journal | vauthors = Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM | title = FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences | journal = BMC Bioinformatics | volume = 22 | issue = 1 | pages = 205 | date = April 2021 | pmid = 33879057 | pmc = 8056616 | doi = 10.1186/s12859-021-04120-9 | doi-access = free }}</ref> A modern annotation [[Pipeline (computing)|pipeline]] can support a user-friendly web interface and software containerization such as MOSGA.<ref name="martin2021">{{cite journal | vauthors = Martin R, Hackl T, Hattab G, Fischer MG, Heider D | title = MOSGA: Modular Open-Source Genome Annotator | journal = Bioinformatics | volume = 36 | issue = 22–23 | pages = 5514–5515 | date = April 2021 | pmid = 33258916 | doi = 10.1093/bioinformatics/btaa1003 | hdl-access = free | hdl = 21.11116/0000-0006-FED4-D | veditors = Birol I }}</ref><ref name="martin2021w">{{Cite web | vauthors = Martin R |title=MOSGA |url=https://mosga.mathematik.uni-marburg.de |access-date=2022-04-25 |website=mosga.mathematik.uni-marburg.de |language=en}}</ref> Modern annotation pipelines for [[prokaryote|prokaryotic]] genomes are Bakta,<ref name="schwengers2021">{{cite journal | vauthors = Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A | title = Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification | journal = Microbial Genomics | volume = 7 | issue = 11 | date = November 2021 | pmid = 34739369 | pmc = 8743544 | doi = 10.1099/mgen.0.000685 | doi-access = free }}</ref> Prokka<ref name="seemann2014">{{cite journal | vauthors = Seemann T | title = Prokka: rapid prokaryotic genome annotation | journal = Bioinformatics | volume = 30 | issue = 14 | pages = 2068–2069 | date = July 2014 | pmid = 24642063 | doi = 10.1093/bioinformatics/btu153 }}</ref> and PGAP.<ref name="li2021">{{cite journal | vauthors = Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F | display-authors = 6 | title = RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation | journal = Nucleic Acids Research | volume = 49 | issue = D1 | pages = D1020–D1028 | date = January 2021 | pmid = 33270901 | pmc = 7779008 | doi = 10.1093/nar/gkaa1105 }}</ref> |
||
The [https://www.bioontology.org/ National Center for Biomedical Ontology] develops tools for automated annotation<ref name="ncbo2023">{{Cite web |title=NCBO Annotator |
The [https://www.bioontology.org/ National Center for Biomedical Ontology] develops tools for automated annotation<ref name="ncbo2023">{{Cite web |title=NCBO Annotator |url=https://ncbo.bioontology.org/annotator-service |access-date=2023-02-08 |website=ncbo.bioontology.org}}</ref> of database records based on the textual descriptions of those records. |
||
As a general method, [[dcGO]]<ref name="pmid23161684">{{cite journal| |
As a general method, [[dcGO]]<ref name="pmid23161684">{{cite journal | vauthors = Fang H, Gough J | title = DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more | journal = Nucleic Acids Research | volume = 41 | issue = Database issue | pages = D536–D544 | date = January 2013 | pmid = 23161684 | pmc = 3531119 | doi = 10.1093/nar/gks1080 }}</ref> has an automated procedure for statistically inferring associations between ontology terms and [[protein domain]]s or combinations of domains from the existing gene/protein-level annotations. |
||
A variety of software tools have been developed that allow scientists to view and share genome annotations, such as [http://www.yandell-lab.org/software/maker.html MAKER]. |
A variety of software tools have been developed that allow scientists to view and share genome annotations, such as [http://www.yandell-lab.org/software/maker.html MAKER]. |
||
Line 223: | Line 182: | ||
* [[WormBase]] |
* [[WormBase]] |
||
==References== |
== References == |
||
{{Reflist}} |
{{Reflist}} |
||
{{Use dmy dates|date=April 2017}} |
{{Use dmy dates|date=April 2017}} |
Latest revision as of 15:22, 9 May 2024
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome,[2] by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate.[3] Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.[4]
Annotation is performed after a genome is sequenced and assembled, and is a necessary step in genome analysis before the sequence is deposited in a database and described in a published article. Although describing individual genes and their products or functions is sufficient to consider this description as an annotation, the depth of analysis reported in literature for different genomes vary widely, with some reports including additional information that goes beyond a simple annotation.[5] Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means. However, the conclusions drawn from the obtained results require manual expert analysis.[6]
DNA annotation is classified into two categories: structural annotation, which identifies and demarcates elements in a genome, and functional annotation, which assigns functions to these elements.[7] This is not the only way in which it has been categorized, as several alternatives, such as dimension-based[8] and level-based classifications,[3] have also been proposed.
History
[edit]The first generation of genome annotators used local ab initio methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one open reading frame (ORF) at a time.[9][10] They appeared as a necessity to handle the enormous amount of data produced by the Maxam-Gilbert and Sanger DNA sequencing techniques developed in the late 1970s. The first software used to analyze sequencing reads is the Staden Package, created by Rodger Staden in 1977.[11] It performed several tasks related to annotation, such as base and codon counts. In fact, codon usage was the main strategy used by several early protein coding sequence (CDS) prediction methods,[12][13][14] based on the assumption that the most translated regions in a genome contain codons with the most abundant corresponding tRNAs (the molecules responsible for carrying amino acids to the ribosome during protein synthesis) allowing a more efficient translation.[15] This was also known to be the case for synonymous codons, which are often present in proteins expressed at a lower level.[13][16]
The advent of complete genomes in the 1990s (the first one being the genome of Haemophilus influenzae sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ab initio methods, but now applied on a genome-wide scale.[9][10] Markov models are the driving force behind many algorithms used within annotators of this generation;[17][18] these models can be thought of as directed graphs where nodes represent different genomic signals (such as transcription and translation start sites) connected by arrows representing the scanning of the sequence. To ensure a Markov model detects a genomic signal, it must first be trained on a series of known genomic signals.[19] The output of Markov models in the context of annotation includes the probabilities of every kind of genomic element in every single part of the genome, and an accurate Markov model will assign high probabilities to correct annotations and low probabilities to the incorrect ones.[20]
As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so-called combiner annotators, which perform both ab initio and homology-based annotation, require fast alignment algorithms to identify regions of homology.[2][9][10]
In the late 2000s, genome annotation shifted its attention towards identifying non-coding regions in DNA, which was achieved thanks to the appearance of methods to analyze transcription factor binding sites, DNA methylation sites, chromatin structure, and other RNA and regulatory region analysis techniques. Other genome annotators also began to focus on population-level studies represented by the pangenome; by doing so, for instance, annotation pipelines ensure that core genes of a clade are also found in new genomes of the same clade. Both annotation strategies constitute the fourth generation of genome annotators.[9][10]
By the 2010s, the genome sequences of more than a thousand-human individuals (through the 1000 Genomes Project) and several model organisms became available. As such, genome annotation remains a major challenge for scientists investigating the human and other genomes.[21][22]
Structural annotation
[edit]Structural annotation describes the precise location of the different elements in a genome, such as open reading frames (ORFs), coding sequences (CDS), exons, introns, repeats, splice sites, regulatory motifs, start and stop codons, and promoters.[6][23] The main steps of structural annotation are:
- Repeat identification and masking.
- Evidence alignment (optional).
- Splice identification (only in eukaryotes).
- Feature prediction (coding and noncoding sequences).
Repeat identification and masking
[edit]The first step of structural annotation consists in the identification and masking of repeats, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and transposons (which are larger elements with several copies across the genome).[2][24] Repeats are a major component of both prokaryotic and eukaryotic genomes; for instance, between 0% and over 42% of prokaryotic genomes consist of repeats[25] and three quarters of the human genome are composed of repetitive elements.[26]
Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly-defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods:[24][27]
- De novo methods. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of sequence conservation in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved coding sequences (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history.[28]
- Homology-based methods. Repeats are identified by similarity (homology) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with de novo methods, but are biased towards previously identified families.
- Structure-based methods. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable.
- Comparative genomic methods. Repeats are identified as disruptions of one or more sequences in a multiple sequence alignment produced by large insertion regions. Although this strategy avoids the poorly-defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question.
After the repetitive regions in a genome have been identified, they are masked. Masking means replacing the letters of the nucleotides (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an open reading frame (ORF) in a transposon as an exon)[24] Depending on the letters used for replacement, masking can be classified as soft or hard: in soft masking, repetitive regions are indicated with lowercase letters (a, c, g, or t), whereas in hard masking, the letters of these regions are replaced with N's. This way, for example, soft masking can be used to exclude word matches and avoid initiating an alignment in those regions, and hard masking, apart from all of this, can also exclude masked regions from alignment scores.[29][30]
Evidence alignment
[edit]The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known expressed sequence tags (ESTs), RNAs and proteins of the organism being annotated with the genome.[31] Although it is optional, it can improve gene sequence elucidation because RNAs and proteins are direct products of coding sequences.[19]
If RNA-Seq data is available, it may be used to annotate and quantify all of the genes and their isoforms located in the corresponding genome, providing not only their locations, but also their rates of expression.[32] However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode operons of more than one gene, and their start and stop codons cannot be determined due to frameshifts and translation initiation factors.[19] To solve this problem, proteogenomics based approaches are employed, which utilize information from expressed proteins often derived from mass spectrometry.[33]
Splice identification
[edit]Annotation of eukaryotic genomes has an extra layer of difficulty due to RNA splicing, a post-transcriptional process in which introns (non-coding regions) are removed and exons (coding regions) are joined.[23] Therefore, eukaryotic coding sequences (CDS) are discontinuous, and, to ensure their proper identification, intronic regions must be filtered. To do so, annotation pipelines must find the exon-intron boundaries, and multiple methodologies have been developed for this purpose. One solution is to use known exon boundaries for alignment; for instance, many introns begin with GT and end with AG.[31] This approach, however, cannot detect novel boundaries, so alternatives like machine learning algorithms exist that are trained on known exon boundaries and quality information to predict new ones.[34] Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low sequence coverage or high error-rates produced during sequencing.[35][36]
Feature prediction
[edit]A genome is divided in coding and noncoding regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is gene prediction, which is why numerous methods have been developed for this purpose.[19] Gene prediction is a misleading term, as most gene predictors only identify coding sequences (CDS) and do not report untranslated regions (UTRs); for this reason, CDS prediction has been proposed as a more accurate term.[24] CDS predictors detect genome features through methods called sensors, which include signal sensors that identify functional site signals such as promoters and polyA sites, and content sensors that classify DNA sequences into coding and noncoding content.[37] Whereas prokaryotic CDS predictors mostly deal with open reading frames (ORFs), which are segments of DNA between the start and stop codons, eukaryotic CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes.[3] CDS prediction methods can be classified into three broad categories:[2][31]
- Ab initio methods (also called statistical, intrinsic, or de novo). CDS prediction is based solely on the information that can be extracted from the DNA sequence. They rely on statistical methods such as the hidden Markov model (HMM). Some methods employ two or more genomes to infer local mutation rates and patterns along the genome.[38]
- Homology-based methods (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with expressed sequence tags (ESTs), complementary DNA (cDNA), or protein sequences.
- Combiners. CDS prediction is done by a combination of both methods mentioned above.
Functional annotation
[edit]Functional annotation assigns functions to the genomic elements found by structural annotation,[7] by relating them to biological processes such as the cell cycle, cell death, development, metabolism, etc.[3] It may also be used as an additional quality check by identifying elements that may have been annotated by error.[2]
Coding sequence function prediction
[edit]Functional annotation of genes requires a controlled vocabulary (or ontology) to name the predicted functional features. However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups. As such, a standardized controlled vocabulary must be employed, the most comprehensive of which is the Gene Ontology (GO). It classifies functional properties into one of three categories (molecular function, biological process, and cellular component) and organizes them in a directed acyclic graph, in which every node is a particular function, and every edge (or arrow) between two nodes indicates a parent-child or subcategory-category relationship.[40][41] As of 2020, GO is the most widely used controlled vocabulary for functional annotation of genes, followed by the MIPS Functional Catalog (FunCat).[42]
Some conventional methods for functional annotation are homology-based, which rely on local alignment search tools.[40] Its premise is that high sequence conservation between two genomic elements implies that their function is conserved as well. Pairs of homologous sequences that appeared through paralogy, orthology, or xenology usually perform a similar function. However, orthologous sequences should be treated with caution because of two reasons: (1) they might have different names depending on when they were originally annotated, and (2) they may not perform the same functional role in two different organisms. Annotators often refer to an analogous sequence when no paralogy, orthology or xenology was found.[19] Homology-based methods have several drawbacks, such as errors in the database, low sensitivity/specificity, inability to distinguish between paralogy and homology,[43] artificially high scores due to the presence of low complexity regions, and significant variation within a protein family.[44]
Functional annotation can be performed through probabilistic methods. The distribution of hydrophilic and hydrophobic amino acids indicates whether a protein is located in a solution or membrane. Specific sequence motifs provide information on posttranslational modifications and final location of any given protein.[19] Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example, protein-protein interaction (PPI) networks usually place proteins with similar functions close to each other.[45]
Machine learning methods are also used to generate functional annotations for novel proteins based on GO terms. Generally, they consist in constructing a binary classifier for each GO term, which are then joined to make predictions on individual GO terms (forming a multiclass classifier) for which confidence scores are later obtained. The support vector machine (SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as k-nearest neighbors (kNN) and convolutional neural network (CNN), have also been employed.[40]
Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms. More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does. Some of these methods compress the GO terms by matrix factorization or by hashing, thus boosting their performance.[42]
Noncoding sequence function prediction
[edit]Noncoding sequences (ncDNA) are those that do not code for proteins. They include elements such as pseudogenes, segmental duplications, binding sites and RNA genes.[28]
Pseudogenes are mutated copies of protein-coding genes that lost their coding function due to a disruption in their open reading frame (ORF), making them untranslatable.[28] They may be identified using one of the following two methods:[46]
- Homology-based method. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened.
- Phylogeny-based method. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as RNA-Seq data analysis, dN/dS ratio, etc.), it means that the sequence is indeed a pseudogene.
Segmental duplications are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD:[47]
- Whole-Genome Assembly Comparison (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly.
- Whole-genome Shotgun Sequence Detection (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region.[48]
DNA binding sites are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in DNA replication and repair, transcriptional regulation, and viral infection. Binding site prediction involves the use of one of the following two methods:[49]
- Sequence similarity based methods. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less conserved.
- Structure based methods. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites.
Noncoding RNA (ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as tRNA, rRNA, snoRNA, and microRNA, as well as noncoding mRNA-like transcripts. Ab initio prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.[50]
Visualization
[edit]File formats
[edit]Visualization of annotations in a genome browser requires a descriptive output file, which should describe the intron-exon structures of each annotation, their start and stop codons, UTRs and alternative transcripts, and ideally should include information about the sequence alignments and gene predictions that support each gene model. Some commonly used formats for describing annotations are GenBank, GFF3, GTF, BED and EMBL.[24] Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.[2]
Genome browsers
[edit]Genomic browsers are software products that simplify the analysis and visualization of large genomic sequence and annotation data to gain biological insight, via a graphical interface.[52][31][53]
Genomic browsers can be divided into web-based genomic browsers and stand-alone genomic browsers. The former use information from databases and can be classified into multiple-species (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and species-specific (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.[54][19]
Comparative visualization of genomes
[edit]Comparative genomics aims to identify similarities and differences in genomic features, as well as to examine evolutionary relationships between organisms.[55] Visualization tools capable of illustrating the comparative behavior between two or more genomes are essential for this approach, and can be classified into three categories based on the representation of the relationships between the compared genomes:[19]
- Dot Plots: This scheme only allows to show the alignment of two genomes, one genome is represented along the horizontal axis and the other along the vertical axis and the dots in the plot represent the genomic elements that are similar between these two annotations.
- Linear representation: This representation uses multiple linear tracks to represent multiple genomes and their features where "track" is a concept that refers to a specific type of genomic feature at a genomic location.
- Circular representation: This representation facilitates comparison of whole microbial or viral genomes. In this visualization mode, concentric circles and arcs are used to represent genomic sections.
Quality control
[edit]The quality of the sequence assembly influences the quality of the annotation, so it is important to assess assembly quality before performing the subsequent annotation steps.[31] In order to quantify the quality of a genome annotation, three metrics have been used: recall, precision and accuracy; although these measures are not explicitly used in annotation projects, but rather in discussions of prediction accuracy.[56]
Community annotation approaches are great techniques for quality control and standardization in genome annotation. An annotation jamboree that took part in 2002, led to the creation of the annotation standards used by the Sanger Institute's Human and Vertebrate Analysis Project (HAVANA).[57][20]
Re-annotation
[edit]Annotation projects often rely on previous annotations of an organism's genome; however, these older annotations may contain errors that can propagate to new annotations. As new genome analysis technologies are developed and richer databases become available, the annotation of some older genomes may be updated. This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions. Re-annotation is therefore a useful approach in quality control.[56][58]
Community annotation
[edit]Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories:[59][3]
- Factory model: Annotation is performed by a completely automated pipeline.
- Museum model: Manual curation by experts is involved to interpret the results of an annotation project.
- Cottage industry model: Annotation is decentralized and is the result of the effort from different part-time curators.
- Party or jamboree model: Consists of a short intensive workshop with leading curators from the community. It was first used in the Drosophila melanogaster genome annotation project.[60]
- Blessed annotator: A variation of the museum model, applied in the Knockout Mouse Project (KOMP), in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work.
- Gatekeeper approach: It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data.
A community annotation is said to be supervised when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called unsupervised community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication.[61]
Wikipedia has multiple WikiProjects aimed at improving annotation. The Gene WikiProject, for instance, operates a bot that harvests gene data from research databases and creates gene stubs on that basis. [62] The RNA WikiProject seeks to write articles that describe individual RNAs and RNA families in an accessible way.[63]
Applications
[edit]Disease diagnosis
[edit]Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy.[41] Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology,[64] Plant-Associated Microbe Gene Ontology[65] or DisGeNET.[66] And some others have been implemented in pre-existing databases like Rat Disease Ontology in the Rat Genome database.[67]
Bioremediation
[edit]A great diversity of catabolic enzymes involved in hydrocarbon degradation by some bacterial strains are encoded by genes located in their mobile genetic elements (MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.[68] In 2013, Phale et al.[69] published the genome annotation of a strain of Pseudomonas putida (CSV86), a bacterium known for its preference of naphthalene and other aromatic compounds over glucose as a carbon and energy source. In order to find the MGEs of this bacterium, its genome was annotated using RAST and the NCBI Prokaryotic Genome Annotation Pipeline (PGAP), and the identification of nine mobile elements was possible with the Insertion Sequence (IS) Finder database. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation,[70] right next to the genes encoding tRNA-Gly and integrase, as well as the identification of the genes encoding enzymes involved in the degradation of salicylate, benzoate, 4-hydroxybenzoate, phenylacetic acid, hydroxyphenyl acetic acid, and the recognition of an operon involved in glucose transport in the strain.
Gene Ontology analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants. This was the approach of the investigation and identification of Halomonas zincidurans strain B6(T), a bacterium with thirty-one genes encoding resistance to heavy metals, especially zinc[71] and Stenotrophomonas sp. DDT-1, a strain capable of using DDT as its sole carbon and energy source,[72] to mention a few examples.
Software
[edit]Genes in a eukaryotic genome can be annotated using various annotation tools[73] such as FINDER.[74] A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA.[75][76] Modern annotation pipelines for prokaryotic genomes are Bakta,[77] Prokka[51] and PGAP.[78]
The National Center for Biomedical Ontology develops tools for automated annotation[79] of database records based on the textual descriptions of those records.
As a general method, dcGO[80] has an automated procedure for statistically inferring associations between ontology terms and protein domains or combinations of domains from the existing gene/protein-level annotations.
A variety of software tools have been developed that allow scientists to view and share genome annotations, such as MAKER.
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available biological databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
- Encyclopedia of DNA elements (ENCODE)
- Entrez Gene
- Ensembl
- FlyBase
- GENCODE
- Gene Ontology Consortium
- GeneRIF
- Mouse Genome Informatics
- RefSeq
- Uniprot
- Vertebrate and Genome Annotation Project (Vega)
- WormBase
References
[edit]- ^ Zheng S, Poczai P, Hyvönen J, Tang J, Amiryousefi A (2020). "Chloroplot: An Online Program for the Versatile Plotting of Organelle Genomes". Frontiers in Genetics. 11 (576124): 576124. doi:10.3389/fgene.2020.576124. PMC 7545089. PMID 33101394.
- ^ a b c d e f Dominguez Del Angel V, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Vinnere Pettersson O, et al. (5 February 2018). "Ten steps to get started in Genome Assembly and Annotation". F1000Research. 7 (148): 148. doi:10.12688/f1000research.13598.1. PMC 5850084. PMID 29568489.
- ^ a b c d e Stein L (July 2001). "Genome annotation: from sequence to biology". Nature Reviews. Genetics. 2 (7): 493–503. doi:10.1038/35080529. PMID 11433356. S2CID 12044602.
- ^ Davis CP (29 March 2021). "Medical Definition of Genome annotation". MedicineNet. Archived from the original on 9 February 2023. Retrieved 17 April 2023.
- ^ Koonin E, Galperin MY (2003). "Genome Annotation and Analysis". Sequence — Evolution — Function (1st ed.). Springer US. pp. 193–226. doi:10.1007/978-1-4757-3783-7_6. ISBN 978-1-4757-3783-7.
- ^ a b Mishra P, Maurya R, Avashthi H, Mittal S, Chandra M, Ramteke PW (2021). "Genome assembly and annotation". In Singh DB, Pathak RK (eds.). Bioinformatics: Methods and Applications (1st ed.). Elsevier Science. pp. 49–66. doi:10.1016/B978-0-323-89775-4.00013-4. ISBN 9780323897754.
- ^ a b Bright LA, Burgess SC, Chowdhary B, Swiderski CE, McCarthy FM (October 2009). "Structural and functional-annotation of an equine whole genome oligoarray". BMC Bioinformatics. 10 (Suppl 11): S8. doi:10.1186/1471-2105-10-S11-S8. PMC 3226197. PMID 19811692.
- ^ Reed JL, Famili I, Thiele I, Palsson BO (February 2006). "Towards multidimensional genome annotation". Nature Reviews. Genetics. 7 (2): 130–141. doi:10.1038/nrg1769. PMID 16418748. S2CID 13107786.
- ^ a b c d Abril JF, Castellano S (2019). "Genome Annotation". In Ranganathan S, Nakai K, Schonbach C, Gribskov M (eds.). Encyclopedia of Bioinformatics and Computational Biology (1st ed.). Elsevier Science. pp. 195–209. doi:10.1016/B978-0-12-809633-8.20226-4. ISBN 978-0-12-811432-2. S2CID 226248103.
- ^ a b c d Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, et al. (August 2016). "NCBI prokaryotic genome annotation pipeline". Nucleic Acids Research. 44 (14): 6614–6624. doi:10.1093/nar/gkw569. PMC 5001611. PMID 27342282.
- ^ Staden R (November 1977). "Sequence data handling by computer". Nucleic Acids Research. 4 (11): 4037–4051. doi:10.1093/nar/4.11.4037. PMC 343220. PMID 593900.
- ^ Staden R, McLachlan AD (January 1982). "Codon preference and its use in identifying protein coding regions in long DNA sequences". Nucleic Acids Research. 10 (1): 141–156. doi:10.1093/nar/10.1.141. PMC 326122. PMID 7063399.
- ^ a b Gribskov M, Devereux J, Burgess RR (January 1984). "The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression". Nucleic Acids Research. 12 (1 Pt 2): 539–549. doi:10.1093/nar/12.1part2.539. PMC 321069. PMID 6694906.
- ^ Fickett JW (August 1996). "Finding genes by computer: the state of the art". Trends in Genetics. 12 (8): 316–320. doi:10.1016/0168-9525(96)10038-X. PMID 8783942.
- ^ Grosjean H, Fiers W (June 1982). "Preferential codon usage in prokaryotic genes: the optimal codon-anticodon interaction energy and the selective codon usage in efficiently expressed genes". Gene. 18 (3): 199–209. doi:10.1016/0378-1119(82)90157-3. PMID 6751939.
- ^ Grantham R, Gautier C, Gouy M, Mercier R, Pavé A (January 1980). "Codon catalog usage and the genome hypothesis". Nucleic Acids Research. 8 (1): r49–r62. doi:10.1093/nar/8.1.197-c. PMC 327256. PMID 6986610.
- ^ Lukashin AV, Borodovsky M (February 1998). "GeneMark.hmm: new solutions for gene finding". Nucleic Acids Research. 26 (4): 1107–1115. doi:10.1093/nar/26.4.1107. PMC 147337. PMID 9461475.
- ^ Salzberg SL, Delcher AL, Kasif S, White O (January 1998). "Microbial gene identification using interpolated Markov models". Nucleic Acids Research. 26 (2): 544–548. doi:10.1093/nar/26.2.544. PMC 147303. PMID 9421513.
- ^ a b c d e f g h Soh J, Gordon PM, Sensen CW (4 September 2012). Genome Annotation. New York: Chapman and Hall/CRC. doi:10.1201/b12682. ISBN 9780429064012. Archived from the original on 18 April 2023. Retrieved 18 April 2023.
- ^ a b Brent MR (December 2005). "Genome annotation past, present, and future: how to define an ORF at each locus". Genome Research. 15 (12): 1777–1786. doi:10.1101/gr.3866105. PMID 16339376.
- ^ ENCODE Project Consortium (April 2011). Becker PB (ed.). "A user's guide to the encyclopedia of DNA elements (ENCODE)". PLOS Biology. 9 (4): e1001046. doi:10.1371/journal.pbio.1001046. PMC 3079585. PMID 21526222.
- ^ Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. (November 2012). "An integrated map of genetic variation from 1,092 human genomes". Nature. 491 (7422): 56–65. Bibcode:2012Natur.491...56T. doi:10.1038/nature11632. PMC 3498066. PMID 23128226.
- ^ a b Kahl G (2015). The dictionary of genomics, transcriptomics and proteomics (Fifth ed.). Weinheim: Wiley. doi:10.1002/9783527678679. ISBN 9783527678679. Archived from the original on 4 August 2022. Retrieved 24 April 2023.
- ^ a b c d e Yandell M, Ence D (April 2012). "A beginner's guide to eukaryotic genome annotation". Nature Reviews. Genetics. 13 (5): 329–342. doi:10.1038/nrg3174. PMID 22510764. S2CID 3352427.
- ^ Treangen TJ, Abraham AL, Touchon M, Rocha EP (May 2009). "Genesis, effects and fates of repeats in prokaryotic genomes". FEMS Microbiology Reviews. 33 (3): 539–571. doi:10.1111/j.1574-6976.2009.00169.x. PMID 19396957.
- ^ Liehr T (February 2021). "Repetitive Elements in Humans". International Journal of Molecular Sciences. 22 (4): 2072. doi:10.3390/ijms22042072. PMC 7922087. PMID 33669810.
- ^ Bergman CM, Quesneville H (November 2007). "Discovering and detecting transposable elements in genome sequences". Briefings in Bioinformatics. 8 (6): 382–392. doi:10.1093/bib/bbm048. PMID 17932080.
- ^ a b c Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB (August 2010). "Annotating non-coding regions of the genome". Nature Reviews. Genetics. 11 (8): 559–571. doi:10.1038/nrg2814. PMID 20628352. S2CID 6617359.
- ^ Edgar RC (October 2010). "Search and clustering orders of magnitude faster than BLAST". Bioinformatics. 26 (19): 2460–2461. doi:10.1093/bioinformatics/btq461. PMID 20709691.
- ^ Edgar R. "Sequence masking". drive5.com. Archived from the original on 3 February 2020. Retrieved 25 April 2023.
- ^ a b c d e Ejigu GF, Jung J (September 2020). "Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing". Biology. 9 (9): 295. doi:10.3390/biology9090295. PMC 7565776. PMID 32962098.
- ^ Garber M, Grabherr MG, Guttman M, Trapnell C (June 2011). "Computational methods for transcriptome annotation and quantification using RNA-seq". Nature Methods. 8 (6): 469–477. doi:10.1038/nmeth.1613. PMID 21623353. S2CID 205419756.
- ^ Gupta N, Tanner S, Jaitly N, Adkins JN, Lipton M, Edwards R, et al. (September 2007). "Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation". Genome Research. 17 (9): 1362–1377. doi:10.1101/gr.6427907. PMC 1950905. PMID 17690205.
- ^ De Bona F, Ossowski S, Schneeberger K, Rätsch G (August 2008). "Optimal spliced alignments of short sequence reads". Bioinformatics. 24 (16): i174–i180. doi:10.1093/bioinformatics/btn300. PMID 18689821.
- ^ Trapnell C, Pachter L, Salzberg SL (May 2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics. 25 (9): 1105–1111. doi:10.1093/bioinformatics/btp120. PMC 2672628. PMID 19289445.
- ^ Križanovic K, Echchiki A, Roux J, Šikic M (March 2018). "Evaluation of tools for long read RNA-seq splice-aware alignment". Bioinformatics. 34 (5): 748–754. doi:10.1093/bioinformatics/btx668. PMC 6192213. PMID 29069314.
- ^ McHardy AC, Kloetgen A (2017). "Finding Genes in Genome Sequence". In Keith JM (ed.). Bioinformatics. Methods in Molecular Biology. Vol. 1525 (Second ed.). New York: Springer. pp. 271–291. doi:10.1007/978-1-4939-6622-6_11. ISBN 978-1-4939-6622-6. PMID 27896725.
- ^ Brent MR, Guigó R (June 2004). "Recent advances in gene structure prediction". Current Opinion in Structural Biology. 14 (3): 264–272. doi:10.1016/j.sbi.2004.05.007. PMID 15193305.
- ^ Binns D, Dimmer E, Huntley R, Barrell D, O'Donovan C, Apweiler R (November 2009). "QuickGO: a web-based tool for Gene Ontology searching". Bioinformatics. 25 (22): 3045–3046. doi:10.1093/bioinformatics/btp536. PMC 2773257. PMID 19744993.
- ^ a b c Vu TT, Jung J (2021). "Protein function prediction with gene ontology: from traditional to deep learning models". PeerJ. 9: e12019. doi:10.7717/peerj.12019. PMC 8395570. PMID 34513334.
- ^ a b Saxena R, Bishnoi R, Singla D (2021). "Gene Ontology: application and importance in functional annotation of the genomic data". In Singh B, Pathak RK (eds.). Bioinformatics : methods and applications. London: Academic Press. pp. 145–157. doi:10.1016/B978-0-323-89775-4.00015-8. ISBN 978-0-323-89775-4.
- ^ a b Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G (2020). "A Literature Review of Gene Function Prediction by Modeling Gene Ontology". Frontiers in Genetics. 11: 400. doi:10.3389/fgene.2020.00400. PMC 7193026. PMID 32391061.
- ^ Sasson O, Kaplan N, Linial M (June 2006). "Functional annotation prediction: all for one and one for all". Protein Science. 15 (6): 1557–1562. doi:10.1110/ps.062185706. PMC 2242553. PMID 16672244.
- ^ Sinha S, Lynn AM, Desai DK (October 2020). "Implementation of homology based and non-homology based computational methods for the identification and annotation of orphan enzymes: using Mycobacterium tuberculosis H37Rv as a case study". BMC Bioinformatics. 21 (1): 466. doi:10.1186/s12859-020-03794-x. PMC 574302. PMID 33076816.
- ^ Letovsky S, Kasif S (2003). "Predicting protein function from protein/protein interaction data: a probabilistic approach". Bioinformatics. 19 (Suppl 1): i197–i204. doi:10.1093/bioinformatics/btg1026. PMID 12855458.
- ^ Dainat J, Pontarotti P (2021). "Methods to Identify and Study the Evolution of Pseudogenes Using a Phylogenetic Approach" (PDF). In Poliseno L (ed.). Pseudogenes. Methods in Molecular Biology. Vol. 2324 (Second ed.). New York: Springer. pp. 21–34. doi:10.1007/978-1-0716-1503-4_2. ISBN 978-1-0716-1503-4. PMID 34165706. S2CID 235625288.
- ^ Numanagic I, Gökkaya AS, Zhang L, Berger B, Alkan C, Hach F (September 2018). "Fast characterization of segmental duplications in genome assemblies". Bioinformatics. 34 (17): i706–i714. doi:10.1093/bioinformatics/bty586. PMC 6129265. PMID 30423092.
- ^ Hartasánchez DA, Brasó-Vives M, Heredia-Genestar JM, Pybus M, Navarro A (November 2018). "Effect of Collapsed Duplications on Diversity Estimates: What to Expect". Genome Biology and Evolution. 10 (11): 2899–2905. doi:10.1093/gbe/evy223. PMC 6239678. PMID 30364947.
- ^ Si J, Zhao R, Wu R (March 2015). "An overview of the prediction of protein DNA-binding sites". International Journal of Molecular Sciences. 16 (3): 5194–5215. doi:10.3390/ijms16035194. PMC 4394471. PMID 25756377.
- ^ Griffiths-Jones S (2007). "Annotating noncoding RNA genes". Annual Review of Genomics and Human Genetics. 8: 279–298. doi:10.1146/annurev.genom.8.080706.092419. PMID 17506659.
- ^ a b Seemann T (July 2014). "Prokka: rapid prokaryotic genome annotation". Bioinformatics. 30 (14): 2068–2069. doi:10.1093/bioinformatics/btu153. PMID 24642063.
- ^ Valeev T, Yevshin I, Kolpakov F (2013). "BioUML Genome Browser". Virtual Biology. 1 (1): 15. doi:10.12704/vb/e8.
- ^ Szot PS, Yang A, Wang X, Parsania C, Röhm U, Wong KH, Ho JW (May 2017). "PBrowse: a web-based platform for real-time collaborative exploration of genomic data". Nucleic Acids Research. 45 (9): e67. doi:10.1093/nar/gkw1358. PMC 5605237. PMID 28100700.
- ^ Wang J, Kong L, Gao G, Luo J (March 2013). "A brief introduction to web-based genome browsers". Briefings in Bioinformatics. 14 (2): 131–143. doi:10.1093/bib/bbs029. PMID 22764121.
- ^ Jung J, Kim JI, Yi G (December 2019). "geneCo: a visualized comparative genomic method to analyze multiple genome structures". Bioinformatics. 35 (24): 5303–5305. doi:10.1093/bioinformatics/btz596. PMC 6954651. PMID 31350879.
- ^ a b Ouzounis CA, Karp PD (2002). "The past, present and future of genome-wide re-annotation". Genome Biology. 3 (2): COMMENT2001. doi:10.1186/gb-2002-3-2-comment2001. PMC 139008. PMID 11864365.
- ^ "Manual Annotation - Wellcome Sanger Institute". www.sanger.ac.uk. Archived from the original on 2 February 2023. Retrieved 28 March 2023.
- ^ Siezen RJ, van Hijum SA (July 2010). "Genome (re-)annotation and open-source annotation pipelines". Microbial Biotechnology. 3 (4): 362–369. doi:10.1111/j.1751-7915.2010.00191.x. PMC 3815804. PMID 21255336.
- ^ Loveland JE, Gilbert JG, Griffiths E, Harrow JL (2012). "Community gene annotation in practice". Database. 2012 (2012): bas009. doi:10.1093/database/bas009. PMC 3308165. PMID 22434843.
- ^ Hartl DL (April 2000). "Fly meets shotgun: shotgun wins". Nature Genetics. 24 (4): 327–328. doi:10.1038/74125. PMID 10742085. S2CID 5354139.
- ^ Mazumder R, Natale DA, Julio JA, Yeh LS, Wu CH (February 2010). "Community annotation in biology". Biology Direct. 5 (1): 12. doi:10.1186/1745-6150-5-12. PMC 2834641. PMID 20167071.
- ^ Huss JW, Orozco C, Goodale J, Wu C, Batalov S, Vickers TJ, et al. (July 2008). "A gene wiki for community annotation of gene function". PLOS Biology. 6 (7): e175. doi:10.1371/journal.pbio.0060175. PMC 2443188. PMID 18613750.
- ^ Daub J, Gardner PP, Tate J, Ramsköld D, Manske M, Scott WG, et al. (December 2008). "The RNA WikiProject: community annotation of RNA families". RNA. 14 (12): 2462–2464. doi:10.1261/rna.1200508. PMC 2590952. PMID 18945806.
- ^ Cooper L, Jaiswal P (2016). "The Plant Ontology: A Tool for Plant Genomics". In Edwards D (ed.). Plant Bioinformatics. Methods in Molecular Biology. Vol. 1374 (2nd ed.). Totowa, N.J.: Humana Press. pp. 89–114. doi:10.1007/978-1-4939-3167-5_5. ISBN 978-1-4939-3167-5. PMID 26519402.
- ^ Torto-Alalibo T, Collmer CW, Gwinn-Giglio M (February 2009). "The Plant-Associated Microbe Gene Ontology (PAMGO) Consortium: community development of new Gene Ontology terms describing biological processes involved in microbe-host interactions". BMC Microbiology. 9 (Suppl 1): S1. doi:10.1186/1471-2180-9-S1-S1. PMC 2654661. PMID 19278549.
- ^ Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI (January 2020). "The DisGeNET knowledge platform for disease genomics: 2019 update". Nucleic Acids Research. 48 (D1): D845–D855. doi:10.1093/nar/gkz1021. PMC 7145631. PMID 31680165.
- ^ Hayman GT, Laulederkind SJ, Smith JR, Wang SJ, Petri V, Nigam R, et al. (2016). "The Disease Portals, disease-gene annotation and the RGD disease ontology at the Rat Genome Database". Database. 2016: baw034. doi:10.1093/database/baw034. PMC 4805243. PMID 27009807.
- ^ Top EM, Springael D, Boon N (November 2002). "Catabolic mobile genetic elements and their potential use in bioaugmentation of polluted soils and waters". FEMS Microbiology Ecology. 42 (2): 199–208. doi:10.1111/j.1574-6941.2002.tb01009.x. hdl:1854/LU-348539. PMID 19709279. S2CID 15173391.
- ^ Phale PS, Paliwal V, Raju SC, Modak A, Purohit HJ (January 2013). "Genome Sequence of Naphthalene-Degrading Soil Bacterium Pseudomonas putida CSV86". Genome Announcements. 1 (1): 234–235. doi:10.1128/genomeA.00234-12. PMC 3587945. PMID 23469351.
- ^ Trivedi VD, Jangir PK, Sharma R, Phale PS (December 2016). "Insights into functional and evolutionary analysis of carbaryl metabolic pathway from Pseudomonas sp. strain C5pp". Scientific Reports. 6 (1): 38430. Bibcode:2016NatSR...638430T. doi:10.1038/srep38430. PMC 5141477. PMID 27924916.
- ^ Huo YY, Li ZY, Cheng H, Wang CS, Xu XW (2014). "High quality draft genome sequence of the heavy metal resistant bacterium Halomonas zincidurans type strain B6(T)". Standards in Genomic Sciences. 9 (30): 30. doi:10.1186/1944-3277-9-30. PMC 4286145. PMID 25945155.
- ^ Pan X, Lin D, Zheng Y, Zhang Q, Yin Y, Cai L, et al. (February 2016). "Biodegradation of DDT by Stenotrophomonas sp. DDT-1: Characterization and genome functional analysis". Scientific Reports. 6 (1): 21332. Bibcode:2016NatSR...621332P. doi:10.1038/srep21332. PMC 4758049. PMID 26888254.
- ^ GAAS, NBIS -- National Bioinformatics Infrastructure Sweden, 13 April 2022, retrieved 25 April 2022
- ^ Banerjee S, Bhandary P, Woodhouse M, Sen TZ, Wise RP, Andorf CM (April 2021). "FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences". BMC Bioinformatics. 22 (1): 205. doi:10.1186/s12859-021-04120-9. PMC 8056616. PMID 33879057.
- ^ Martin R, Hackl T, Hattab G, Fischer MG, Heider D (April 2021). Birol I (ed.). "MOSGA: Modular Open-Source Genome Annotator". Bioinformatics. 36 (22–23): 5514–5515. doi:10.1093/bioinformatics/btaa1003. hdl:21.11116/0000-0006-FED4-D. PMID 33258916.
- ^ Martin R. "MOSGA". mosga.mathematik.uni-marburg.de. Retrieved 25 April 2022.
- ^ Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A (November 2021). "Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification". Microbial Genomics. 7 (11). doi:10.1099/mgen.0.000685. PMC 8743544. PMID 34739369.
- ^ Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, et al. (January 2021). "RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation". Nucleic Acids Research. 49 (D1): D1020–D1028. doi:10.1093/nar/gkaa1105. PMC 7779008. PMID 33270901.
- ^ "NCBO Annotator". ncbo.bioontology.org. Retrieved 8 February 2023.
- ^ Fang H, Gough J (January 2013). "DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more". Nucleic Acids Research. 41 (Database issue): D536–D544. doi:10.1093/nar/gks1080. PMC 3531119. PMID 23161684.