Software release notes for the NCBI Eukaryotic Genome Annotation Pipeline
The software used for the NCBI annotation pipelines is under active development. This page provides a list of the major changes incorporated in releases of the Eukaryotic Genome Annotation Pipeline software.
Version 10.3
Release date: June 20 2024
Process
- Automated computation of maximum allowed intron length and applying that value across multiple alignment tools
- Computation of normalized gene expression counts for all RNA-seq datasets aligned as part of the annotation pipeline
- Introduction of Diamond alignment tool to generate protein-to-protein alignments
- Ortholog calculation for an expanded set of arthropod genomes
- Various bug fixes and improvements
- Updates to versions of third party software and data:
- RepeatMasker v4.1.5
- Rfam v14.10
- tRNAscan-SE v2.0.12
- BUSCO v.5.7.1
- samtools v1.20
Reporting
- Addition of a new downloadable file to the FTP site, in connection with the addition of normalized gene expression data:
- *_normalized_gene_expression_counts.txt.gz: tab-delimited text file with counts of normalized RNA-Seq reads mapped to each gene
Version 10.2
Release date: September 6 2023
Process
- Assignment of Gene Ontology terms to annotated proteins using InterProScan
- Improvements in the handling of cross-species RNA-Seq alignments with STAR
- Calculation of expression per RNA-Seq run and per gene using using Subread featureCounts software
- Improved filtering of PacBio and ONT RNA alignments used for model generation
- Incremental improvements to internal processing and performance
Reporting
- Addition of new downloadable files to our FTP site, in connection with the features above
- Gene Ontology annotation of the annotated genes in GO Annotation File (GAF) format. See files *_gene_ontology.gaf.gz
- Addition of featureCounts output files. These files provide the counts of reads per RNA-Seq run per gene, for all RNA-Seq runs used in the annotation, and some metadata:
- *_gene_expression_counts.txt.gz: tab-delimited text file with counts of RNA-Seq reads mapped to each gene
- *_rnaseq_runs.txt: tab-delimited text file containing information about RNA-Seq runs used for gene expression analyses
- *_rnaseq_alignment_summary.txt files: tab-delimited text file containing information about assignment of the aligned reads to genes
- Addition of RNA-Seq coverage graph in UCSC bigWig format, for each SRA run aligned to the genome. See *_graph.bw in the RNASeq_coverage_graphs directory.
Version 10.1
Release date: December 14 2022
Process
- Better identification and removal of chimeric alignments by STAR for more accurate predictions of paralogous genes
- Trimming of low-entropy terminal exons identified minimap2
- Revised annotation of RefSeq NM_/NR_ features with large inserts (e.g. an Alu repeat found in the transcript and not the genome) to use a single exon rather than two abutting exons
- Improvements for PAR annotation and gene placement when annotating multiple assemblies (e.g. human GRCh38.p14 and CHM13_T2Tv2.0)
- Added support for annotation of human GenBank assemblies using curated RefSeq data, available under genomes/all/pilot
- Incremental improvements to internal processing and performance
Reporting
- New nomenclature for annotations. Starting with this release, annotations will be named after the assembly accession and date on which the annotation was started. For example, the name of the annotation for assembly GCF_016801865.2 started in December 2022, is GCF_016801865.2-RS_2022_12.
Version 10.0
Release date: June 14 2022
Process
- Aligner change for RNA-Seq reads from Splign to STAR (Dobin A, et al. Bioinformatics 2013, 29(1):15-21)
- Upgrade of RFAM library to RFAM 14.6, for the prediction of small non-coding RNAs (rRNAs, snRNAs and snoRNAs)
- Incremental improvements to internal processing and performance
Version 9.0
Release date: June 8 2021
Process
- Addition of a module for the refinement of transcription start sites with Cap analysis gene expression (CAGE) data. (Applied only in the annotation of species with public CAGE data in SRA.)
Reporting
- Addition of cap and/or polyA sites information on genomic and transcript records, when experimental support is available (CAGE for cap and RNA-Seq for polyA sites).
- On genomic records, cap and polyA_site evidence are in the /experiment field of the .gbk files, as
/experiment="COORDINATES: polyA evidence [ECO:0006239]"
or/experiment="COORDINATES: cap analysis [ECO:0007248] and polyA evidence [ECO:0006239]"
- On transcript records, cap evidence is represented as misc_features and polyA as polyA_site features. See for example XM_027966739.2:
- On genomic records, cap and polyA_site evidence are in the /experiment field of the .gbk files, as
misc_feature 1
/gene="D2HGDH"`
/experiment="COORDINATES: cap analysis [ECO:0007248]"
/note="transcription start site"
[...]
polyA_site 2524
/gene="D2HGDH"
/experiment="COORDINATES: polyA evidence [ECO:0006239]"
- The cap and polyA sites are present in column 9 of the GFF3 file.
Version 8.6
Release date: February 24 2021
Process
- Change in masking of genome repeats prior to alignments of transcripts and protein evidence:
- Use of WindowMasker for all organisms but human and mouse
- For human and mouse, switched RepeatMasker to using Dfam HMMs rather than RepBase libraries
- Normalization of the 5' and 3'-UTR ends of RefSeq model transcripts (with XM or XR prefix) with the curated RefSeq transcripts (NM or NR prefix) of the same gene with the same terminal exon
Reporting
- Addition to the web and XML annotation reports of:
- BUSCO results, calculated on the annotated gene set using the longest protein from each gene
- Per-run alignment statistics of long RNA-Seq reads, generated with long-read sequencing technologies such as PacBio or Oxford Nanopore
- Removal from the FTP site of files reporting genomic spans masked by RepeatMasker (*rm.out.gz files)
Version 8.5
Release date: July 9 2020
Process
- Upgrade of minimap2 to version 2.17, for aligning SRA long read transcriptomes
- Upgrade of tRNAscan-SE to version 2.0.4, for prediction of tRNAs
- Incremental improvements to internal processing and performance
Version 8.4
Release date: March 17 2020
Process
- Improvement in the naming process for fish genes. We have switched to primarily applying gene symbols and names from zebrafish, which are mostly provided by the Zebrafish Information Network (ZFIN), instead of human, to other fish orthologs. The end result is more ortholog connections, and better nomenclature.
Version 8.3
Release date: November 25 2019
Process
- Aligner change for SRA long read transcriptomes (PacBio IsoSeq, Oxford Nanopore technologies, etc...) from Splign to Minimap2 (Li H Bioinformatics 2018, 34(18):3094-3100)
- Incremental improvements to internal processing and performance
Reporting
- Addition of annotated transcripts in BAM format to the files available for download
- Files for the annotated assemblies now available under genomes/refseq. Files in genomes/Genus_species will be archived on February 1, 2020 as announced December 5, 2019
Version 8.2
Release date: March 8 2019
Process
- Upgrade of RepeatMasker to version 4.0.8 and RepBase-20181026
- Incremental improvements to internal processing and performance
Version 8.1
Release date: June 21 2018
Process
- Incremental improvements to internal processing and performance
Version 8.0
Release date: November 20 2017
Process
- Addition of a module to the pipeline to annotate small non-coding RNAs (rRNAs, snRNAs and snoRNAs), using cmsearch from the Infernal package and RFAM 12.0 HMMs for eukaryotes (Nawrocki EP, et al. Nucleic Acids Research 2015, 43(Database issue):D130-7).
Reporting
- Changes in the web annotation reports. These result in higher consistency with the NCBI GFFs and other downloadable files. Note that web reports for annotations executed with software older than version 8.0 were not updated to the new format.
- Features annotated on organelles are now included in the 'Gene and Feature statisitics' section
- Changes in the break-down of reported features:
- Immunoglobulin/T-cell receptor gene segments are reported separately from protein-coding genes.
- Pseudogenes are reported as two categories, transcribed and non-transcribed pseudogenes.
Version 7.4
Release date: April 19 2017
Process
- Incremental improvements to internal processing and performance
Reporting
- In compliance with a NCBI-wide change, gi numbers are no longer included in FASTA and GenBank format files (.fa, .mfa, .gbk and .gbs) provided on our FTP site.
- In the RNA-Seq alignments section of the annotation reports, report of the 'Percent of aligned reads with introns' instead of the 'Percent spliced reads'. The 'Percent of aligned reads with introns' is the proportion of reads with a spliced alignments out of all aligned reads.
- In the RNA-Seq alignments section of the annotation reports, correction in the calculation of the' Percent aligned reads'. In some reports generated prior to version 7.4, the denominator included the count of reads from small numbers SRA runs that were not used in the annotation.
Version 7.3
Release date: February 9 2017
Process
- Improvements in the alignment process for curated RefSeq sequences in masked regions of the genome
- Improvements in the global alignment process of protein evidence to the genome
- Incremental improvements to internal processing and performance
Reporting
- In the eukaryotic annotation status page, addition of links to the Genome Data Viewer (GDV) for genomes assembled to the level of chromosomes
- In the RNA-Seq alignments section of the annotation reports, addition of publications associated with RNA-Seq data
Version 7.2
Release date: September 27 2016
Process
- Added option to include in the final annotation Gnomon models with up to 99% ab initio sequence and no BlastP hit. This option may be used for annotating organisms distant from reference genomes, and for which little long-ranging same or cross-species primary evidence is publicly available and align to the genome (i.e. some invertebrates or fungi).
- Refinements to pairwise orthology calculations to be more conservative when there are multiple paralogs and no supporting synteny information
- Incremental improvements to internal processing
Reporting
- Changes to GFF3 files. ncRNA features are now represented in the type field (column 3) with specific SO terms associated with their ncRNA_class (lnc_RNA, SRP_RNA, snRNA , RNase_MRP_RNA, etc). The "ncrna_class" attribute is no longer provided in the attributes field (column 9).
Version 7.1
Release date: June 8 2016
Process
- Upgrade of RepeatMasker to version 4.0.6, along with RepBase Update 20150807 and RM database version 20150807
- Incremental improvements to internal processing
Version 7.0
Release date: April 8 2016
Process
- Execution of the annotation process on top-level sequences (chromosomes, and unplaced and unlocalized scaffolds) instead of scaffolds. This change improves the annotation of features spanning gaps between adjacent scaffolds. For the near future, SNP annotation will remain on scaffolds.
- Assignment of unique GeneIDs to tRNAs annotated at different locations. Note that tRNAs with the same anticodon are assigned the same Gene symbol. This change increases consistency with other gene types.
- Bug fix in the handling of coding models with a high proportion of ab initio sequence (>50%)
- Restriction in the generation of alternative variants for alternate loci units. If a gene with a known RefSeq transcript (NM_ or NR_prefix) is placed on an alternate locus, no alternate variant model (XM_ or XR_ prefix) is created for the gene on this alternate locus. Given sufficient evidence, alternative variants for gene with known RefSeq will continue being generated on the primary assembly unit. This change will affect the annotation of alternate loci units in human and mouse.
- Incremental improvements to internal processing and annotation consistency
Reporting
- In Nucleotide:
- GenBank, Graphics and ASN views of RefSeq placed scaffolds no longer show any annotation (see for exampleNW_001594469.1)
- ASN view of RefSeq chromosomes now include the annotation.
- On the FTP site (see for example the recent re-annotation of platypus)
- GFF files are now only provided for top-level sequences.
- Files in the CHR_* directories for nuclear chromosomes no longer include annotation on placed scaffolds.
- Masked spans (masking_coordinates.gz) are now in top-level coordinates.
- Comparison of current to previous annotation (comparison directory) are now in top-level coordinates.
Version 6.5
Release date: November 23 2015
Process
- Due to low usage of the STS (Sequence Tagged Sites) placement information on annotated sequences, the process that maps STSs has been discontinued. STS annotation will not be produced for new RefSeq sequences, but will remain available for sequences last annotated before November 20, 2015.
- Better handling of stranded RNA-seq reads
- Incremental improvements to internal processing and annotation consistency
Reporting
- Addition of a section to the HTML annotation reports, "Comparison of current and previous annotations", for organisms that are re-annotated (see this example). This new section indicates how much of the annotation on each assembly has changed between the current and the previous annotation releases and provides links to downloadable full reports. The full reports (in tabular and Genome Workbench formats) are on our FTP site and contain the mappings of current to previous genes and transcripts. Summary counts by category of change are available in the XML annotation report, annotation_report.xml file (<AnnotationComparison> section), also in the FTP directory.
- Addition to the annotation_report.xml <RnaseqAlignReport> section of the <Stranded> tag to the individual SRA runs that were generated with a strand-specific isolation technique
- Changes to GFF3-formatted files:
- Transcript features for model RefSeqs now contain the attribute "model_evidence" in column 9, listing the source and number of supporting evidence and percent coverage by RNA-Seq samples, similar to reporting in the flatfile format.
- GFF3 output has been changed to only use small gaps (1-2 bp) (aka micro-introns) to correct for frameshifts, even if the RefSeq product has an insertion. Earlier files from software releases 6.3 and 6.4 used small overlaps to represent insertions according to INSDC specifications, but these overlaps weren’t compatible with some external software.
Version 6.4
Release date: July 22 2015
Process
- Improvement in the RNA-Seq alignment process. Prior to alignment to the genome, SRA runs are now evaluated for strandedness and reads of stranded runs are aligned in the sense orientation only. Unstranded runs are aligned in both orientations and logic to determine the best strand is applied downstream as before.
- Incremental improvements to internal processing and annotation consistency
Reporting
- Changes to GFF3-formatted files. Genes in the GFF files for the final annotation now contain the attribute "gene_biotype" in column 9, making explicit whether a gene is coding, non-coding, pseudogene, etc... See more details in the GFF3 documentation.
Version 6.3
Release date: April 21 2015
Process
- Improvement in the annotation of model proteins containing selenocysteine residues (see for example XM_012546481.1)
- Selenocysteine residues are now represented with a "U" (instead of a code-breaking "X") in protein sequences.
- Titles of selenocysteine-containing proteins are not prefixed any more with "LOW QUALITY PROTEIN" unless the proteins contains corrections for the genome.
- Transcripts and annotation of the parent genomic sequences contain a /transl_except that explicitly provides the location of the selenocysteine residue in the sequence.
- Refinement in the logic that weighs alignments of same-species transcript versus cross-species validated RefSeq proteins to favor same-species transcripts. This change results in a smaller number of models with frameshifts or code-breaks.
- Improvement of models bordering assembly gaps
- Better handling of alignments of protein evidence affected by assembly gaps
- Generation of alternative variants of gap-filled models, if alternative variants are supported by the evidence and if the gap-filled portion is identical in all variants
- Trimming of UTRs in gap-filled portion of a transcript if shorter than 100 bases
Reporting
- Change in the reporting of RNA-Seq alignment statistics in the "Short read transcript alignments" section of the annotation reports. Raw counts of aligned and spliced reads are estimates and are subject to small variations (within 1%) from run to run, therefore only percentages rounded to the nearest integer are now reported.
Version 6.2
Release date: December 3 2014
Process
- Improvements to alignments and model generation algorithms
- Exclusion of low-entropy RNA-Seq reads from the set of reads aligned to the genome
Reporting
- Addition of a section to the annotation reports, "Alignment of the annotated proteins to a set of high-quality proteins", providing the counts of annotated proteins with BlastP hits against a database of high-confidence proteins (e.g. UniProtKB/Swiss-Prot), at several coverage thresholds. For comparison purposes the data is also provided for a selection of related organisms that were recently annotated.
- Bug fix in the calculation of the number of RNA-Seq reads aligned to the genome presented in the "Short read transcript alignments" section of the annotation reports. Statistics in reports pre-dating the 6.2 release may be off by a few percent.
- Modification of the representation of multi-interval non-trans-spliced tRNA features in GFF3 files. Each multi-interval non-trans-spliced tRNA feature is now represented by a single feature (line) of type tRNA and multiple nested features of type exon (one for each interval).
- Modification of the representation of transcripts with indels compared to the genome in GFF3 files. Insertions in transcripts within the coding region are now represented by a small overlap between the two halves of a split exon, and deletions within the coding region are represented by very short introns between the two halves of an exon. This allows software to properly interpret the reading frame. Note that the conceptual sequence of the feature can still differ from the transcript or protein sequence because of mismatches, gaps, and when overlapping genome sequence does not match the sequence of an insertion.
Version 6.1
Release date: August 4 2014
Process
- Addition of a post tRNAscan-SE filter to limit probable noise in tRNA predictions
- Bug fix in the unique hit exon coverage track displayed in Gene, that caused reads with multiple placements to be included
Reporting
- In the "Short read transcript alignments" section of the annotation reports, addition of the alignment statistics per RNA-Seq SRA run, in addition to the alignment statistics per sample
Version 6.0
Release date: April 17 2014
Process
- For model RefSeqs extending into assembly gaps, construction of transcript (XM_/XR_) and protein (XP_) products using a combination of genomic and transcript sequence (RefSeq, INSDC or TSA) to compensate for missing genomic sequence.
- Improvements to identification of orthologs compared to a reference taxon, including more robust analysis of protein BLAST alignments. These changes result in more ortholog calls, especially for more distantly related taxa, with lower false-match rates. The results are used for gene naming, and are reported in Gene.
- Redesign of the code for categorizing genes by type (protein-coding, pseudogene, non-coding) and assigning names to genes and products (transcript and protein RefSeqs). These changes allow for more automation and higher throughput, as well as improve the identification of pseudogenes and low-quality protein-coding genes.
- Change in the naming of model RefSeq variants and isoforms to use the same isoform name for multiple variants that differ only in the UTRs, and to use the same variant and isoform names for equivalent model RefSeqs annotated on multiple assemblies.
Reporting
- For model RefSeqs extending into assembly gaps, addition to the nucleotide records of the source of the model spans. For example, XM_007659754.1 is a model with three exons annotated on genomic sequence AAPN01287557.1 and was allowed to extend at the 5-prime end into an assembly gap based on the alignment of transcript JQ350810.1. The flat file for this record contains the following three indicators of the origin of the model:
A comment:
An assembly gap attribute:
A PRIMARY block providing the spans of the RefSeq model on the genomic or transcript (primary) sequence:
- For model RefSeqs extending into assembly gaps, annotation of the genomic mRNA and CDS features with partial features (< or > in the flatfile view), either at internal intervals or at the 5-prime or 3-prime end, to indicate the location of the missing sequence.
- Addition of a structured comment of RefSeq attributes to the nucleotide and protein records of model RefSeqs with ab initio span(s) and/or with corrections (see XM_007529441.1 for example). The comment indicates the following, as appropriate for each model:
- Ab initio span(s): % bases not supported by evidence and produced by the ab initio component of Gnomon
- Frameshift(s): number of indels corrected
- Internal stop codon(s): number of genomic stop codons corrected
- Assembly gap(s): number of transcript bases added to fill a genome assembly gap (see above)
- Addition of keyword "corrected model" to models with frameshifts, internal stop codons or assembly gaps; and keyword "includes ab initio" to models with ab initio spans.
- Addition to the annotation reports of the number of model RefSeqs with genomic gaps filled with transcript sequence.
- Change in the annotation reports for the calculation of the number of corrected model RefSeqs. The new count, "model RefSeq with major corrections", includes all model RefSeq proteins with major corrections (CDSs with correction for internal stop-codons, frameshifts or internal gaps).
- Changes to GFF3-formatted files:
- Incorporation of the start_range and end_range attributes from the GVF specification to indicate partial features. The GFF3 specification currently does not include any formal mechanism to indicate partial features, so these attributes are borrowed from GVF with non-official (lowercase) tags. In NCBI's annotation files, presence of a start_range attribute can simply be interpreted as column 4 is partial, and an end_range attribute as column 5 is partial, regardless of strand, without further analysis of the tag value. Further details about the attributes are available in the GVF specifications.
- Reduced usage of URL escaping in attribute values.
Version 5.2
Release date: November 19 2013
Process
- Exclusion of spans in protein alignments from use by gene prediction if the spans contain an intron with much lower RNA-Seq support than the rest of the alignment.
- Classification of model RefSeqs (XR_) for predicted non-coding genes as ncRNA of type lncRNA rather than misc_RNA.
- Improvements to RNA-Seq filtering criteria in regions of alternative splicing.
- Improvements to model predictions in regions of closely-spaced or overlapping genes.
- Improvements to the assembly-assembly alignment process, used for tracking genes across assemblies.
- Performance improvements.
Reporting
- Production of a report with each annotation run summarizing the features annotated and the alignments used for gene prediction. This report is available in HTML (see URL in the README_CURRENT_RELEASE file) and in XML on the FTP site.
- Change in the format of the README_CURRENT_RELEASE file distributed on the FTP site.
- Phase-out of the production of RefSeq scaffold BLAST databases. Top-level (chromosomes, unplaced and unlocalized scaffolds) BLAST databases are now the default on the organism-specific BLAST pages.
- Increased stringency for the CpG islands displayed in Map Viewer. Only islands meeting the "strict" definition of 500bp or more in length, 50% or higher in GC content and 0.60 or higher observed CpG / expected CpG are now shown in the CpG island map.
Version 5.1
Release date: July 19 2013
Process
- Exclusion of spans in EST or mRNA alignments from use by gene prediction if the spans contain an intron with much lower RNA-Seq support than the rest of the alignment.
- Allowed co-existence of known RefSeq (NM/NR/NP_ accessions) and model RefSeq (XM/XR/XP_ accessions) on the same gene, resulting in an increase in the number of alternate variants for organisms with large amount of evidence (i.e. RNA-Seq).
Version 5.0
Release date: April 11 2013
Process
- Addition of a process to align RNA-Seq short reads from SRA to the genome.
- Incorporation of RNA-Seq alignments in gene prediction.
- Performance improvements.
Reporting
- Production of RNA-Seq coverage graphs and intron feature tracks.
- Addition of BioSamples in the annotated features' evidence support summary on the model RefSeq records.
Version 4.1
Release date: January 8 2013
Process
- Classification of model RefSeqs (XR_) for predicted non-coding genes as misc_RNA.
- Performance improvements.
Reporting
- Addition of a /note on RNA and CDS features describing differences between the annotation product and the genome.
- Addition of the BioProject ID on model RefSeq records (XM/XR/XP_).
Version 4.0
Release date: May 21 2012
Process
- For some genomes, addition of ab initio predictions to the model RefSeq set if these have high-quality BLAST hits to known proteins.
- Improvements to the assembly-assembly alignment process, used for tracking genes across assemblies.
- Improvements to the alignment of genomic sequence to the genome. Alignments with long gaps are now split in the Map Viewer display.
- Performance improvements.
Reporting
- Addition of annotation files in GFF3 format to the FTP site.
- Addition of BLAST databases of top-level molecules (chromosomes, unplaced and unlocalized scaffolds) to the set of BLAST databases displayed in the organism-specific BLAST pages.