Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


With the aim to provide a resource for functional and evolutionary study of plant transcription factors (TFs), we updated the plant TF database PlantTFDB to version 3.0 (http://planttfdb.cbi.pku.edu.cn). After refining the TF classification pipeline, we systematically identified 129 288 TFs from 83 species, of which 67 species have genome sequences, covering main lineages of green plants. Besides the abundant annotation provided in the previous version, we generated more annotations for identified TFs, including expression, regulation, interaction, conserved elements, phenotype information, expert-curated descriptions derived from UniProt, TAIR and NCBI GeneRIF, as well as references to provide clues for functional studies of TFs. To help identify evolutionary relationship among identified TFs, we assigned 69 450 TFs into 3924 orthologous groups, and constructed 9217 phylogenetic trees for TFs within the same families or same orthologous groups, respectively. In addition, we set up a TF prediction server in this version for users to identify TFs from their own sequences.

Free full text 


Logo of narLink to Publisher's site
Nucleic Acids Res. 2014 Jan 1; 42(Database issue): D1182–D1187.
Published online 2013 Oct 29. https://doi.org/10.1093/nar/gkt1016
PMCID: PMC3965000
PMID: 24174544

PlantTFDB 3.0: a portal for the functional and evolutionary study of plant transcription factors

Abstract

With the aim to provide a resource for functional and evolutionary study of plant transcription factors (TFs), we updated the plant TF database PlantTFDB to version 3.0 (http://planttfdb.cbi.pku.edu.cn). After refining the TF classification pipeline, we systematically identified 129 288 TFs from 83 species, of which 67 species have genome sequences, covering main lineages of green plants. Besides the abundant annotation provided in the previous version, we generated more annotations for identified TFs, including expression, regulation, interaction, conserved elements, phenotype information, expert-curated descriptions derived from UniProt, TAIR and NCBI GeneRIF, as well as references to provide clues for functional studies of TFs. To help identify evolutionary relationship among identified TFs, we assigned 69 450 TFs into 3924 orthologous groups, and constructed 9217 phylogenetic trees for TFs within the same families or same orthologous groups, respectively. In addition, we set up a TF prediction server in this version for users to identify TFs from their own sequences.

INTRODUCTION

Transcription factors (TFs) play key roles in plant development and stress response by temporarily and spatially regulating the transcription of their target genes. TFs are usually classified into different families based on their DNA-binding domains (DBDs). In 2000, Riechmann et al. (1) made the first attempt for the genome-wide analysis of TFs in Arabidopsis thaliana soon after the availability of its whole genome sequence. In the following years, several databases dedicated to identification and annotation of plant TFs became publicly available, either for multiple species, such as PlnTFDB (2), PlanTAPDB (3), GRASSIUS (4), LegumeTFDB (5), DATFAP (6) and TreeTFDB (7), or for individual organisms, such as AGRIS (8), RARTF (9), TOBFAC (10), SoyDB (11) and wDBTF (12). During the past 8 years, we have constructed three species-specific TF databases DATF (13), DRTF (14) and DPTF (15) for model organisms Arabidopsis, rice and poplar, as well as a comprehensive plant TF database (PlantTFDB) (16,17). The databases we constructed were accessed >10 million hits per year and were widely used for functional and evolutionary study of plant TFs, as well as for the prediction and annotation of TFs in newly sequenced genomes.

To meet requirements from our user community, we updated PlantTFDB to version 3.0 (http://planttfdb.cbi.pku.edu.cn/). In comparison with the previous two versions, PlantTFDB 3.0 covers more species and more TFs identified by the refined family assignment rules and improved prediction pipeline. In addition, new types of annotations were added, and phylogenetic trees and orthologous groups (OGs) were re-constructed. Finally, an online TF prediction server was set up (Table 1).

Table 1.

Comparison among the three versions of PlantTFDB

PlantTFDBVersion 1.0Version 2.0Version 3.0
Species224983
Species with genome sequences52867
Species without genome sequences172116
TF family645858
TF number26 40253 574129 288
Annotation
    Expert-curated descriptionNoNoYes
    ExpressionYesYesYes
    RegulationNoNoYes
    InteractionNoNoYes
    PhenotypeNoNoYes
    ReferenceYesYesYes
Orthologous groupYesYesYes
Phylogenetic tree
    FamilyNoYesYes
    Orthologous groupNoNoYes
Web serviceNoYesNo
TF prediction serverNoNoYes

We believe that PlantTFDB 3.0 provides users with complete TF datasets, comprehensive annotations and useful analysis tools.

MATERIALS AND METHODS

Figure 1 shows the main steps in the construction of PlantTFDB 3.0, including data integration, TF classification, TF annotation and construction of orthologous groups.

An external file that holds a picture, illustration, etc.
Object name is gkt1016f1p.jpg

The flowchart for construction of PlantTFDB 3.0.

Sequence data

We downloaded protein sequences of 67 species with genome sequences from the Joint Genome Institute (JGI) and several other institutions engaged in plant genome sequencing and annotation projects (Supplementary Table S1). For 16 species without genome sequences, we downloaded their expressed sequence tag sequences from UniGene (18) and PlantGDB-assembled unique transcripts from PlantGDB (19), and then built reference proteome for each species (Supplementary Table S2) using a previous established pipeline (17).

Family assignment rules

TFs are usually classified into different families based on their DBDs. We used auxiliary and forbidden domains to distinguish complicated TF families with multiple signature domains. After a comprehensive literature review, we improved the family assignment rules described in the previous version (17) and arranged several families into superfamilies (Figure 2). We removed the forbidden domain Glyco_hydro_14 of the BES1 family, as recent studies demonstrated that BES1 family proteins with this domain also showed TF activity (20).

An external file that holds a picture, illustration, etc.
Object name is gkt1016f2p.jpg

Refined family assignment rules used for TF identification and assignment. Green ellipses represent TF families and red rectangles represent DBDs. Blue rectangles denote auxiliary domains and purple rectangles denote forbidden domains. Green solid lines link families and DBDs or auxiliary domains and number ‘1’ or ‘2’ indicates number of DBDs. Red dash lines link families and forbidden domains. Families belonging to the same superfamily are arranged within rectangles or rhombi.

Prediction pipeline

We refined the TF prediction pipeline by updating the hidden Markov model (HMM) profiles used to identify TFs and adjusted their thresholds. We downloaded the latest version of HMM profiles from Pfam (version 27.0) (21) for most signature domains and built our own HMM profiles for the remaining domain that did not have available Pfam HMM profiles. We used HMMER 3.0 (22) to identify TFs and assigned them into different families according to the family assignment rules described earlier.

Annotation pipeline

We used a pipeline comprising several packages to annotate identified TFs. Domain structure and GO annotation were predicted by InterProScan (version 4.8) (23). Cross-links to well-known resources were assigned to the best BLAST hits with maximal e-value 1e-10. Nuclear localization signals were predicted by PredictNLS (24). Other information such as expert-curated description, expression, regulation, conserved elements and references was collected from corresponding databases. Multiple sequence alignments (MSAs) for DBDs were constructed by HMM-guided method, and MSAs for full-length protein sequences were inferred by T-coffee (version 9.03) (25). Family trees across 83 species were inferred by FastTree (version 2.1.3) (26) with 100 resamplings. Family trees within each species were inferred by MrBayes (version 3.2.1) (27) based on the Dayhoff model for 50 000 generations. The Help page (http://planttfdb.cbi.pku.edu.cn/help_info.php#tfinfo) describes more detailed information on datasets and parameter settings.

Orthologous groups

Orthologous groups were inferred using the following methods implemented as a pipeline of Plaza (Figure 3) (28).

An external file that holds a picture, illustration, etc.
Object name is gkt1016f3p.jpg

The pipeline for construction of orthologous groups.

First, we selected a representative gene model for each locus from 67 species with genome sequences and filtered out proteins if their lengths were <50 aa. Then we classified these proteins into clusters by TribeMCL (29). After that, proteins within the same cluster were assigned into orthologous groups by OrthoMCL (30). For TFs in the same orthologous group, MSAs were constructed by T-coffee and phylogenetic trees were inferred by MrBayes (27) with the same parameters described earlier.

RESULTS AND DISCUSSION

Genomic TF repertoires of green plants

Using the refined TF prediction pipeline, we identified 129 288 TFs (116 585 loci) from 2 691 496 proteins (2 437 666 loci) of 83 species (Table 2, Supplementary Tables S3 and S4).

Table 2.

Average number of TFs in different taxonomic lineages summarized from 67 species with genome sequences

LineageSpeciesGeneTF (%)Family
Chlorophyta1010 550141 (1.34)35
Bryophytaa132 2731079 (3.34)53
Lycopodiophytab122 271665 (2.99)54
Coniferophytac171 1581851 (2.60)55
Basal Magnoliophytad126 846900 (3.35)58
Monocot1534 0171701 (5.00)58
Eudicot3834 7981861 (5.35)58

aPhyscomitrella patens.

bSelaginella moellendorffii.

cPicea abies.

dAmborella trichopoda.

The increased number of species with genome sequences and the availability of a conifer genome (31) gave us the chance to show the genomic TF repertoires across green plants for the first time (Table 2, Supplementary Table S3). Compared with green alga, land plants have a large increase in the number of TF families, TFs and percentage of TFs in their genome, which might correlate with morphological complexity of land plants (32).

Comprehensive annotations for TFs

A database of well-annotated TFs may provide users with rich information as well as insightful clues for further study. In an attempt to construct a comprehensive knowledgebase for plant TFs, we collected expert-curated description, expression, regulation, mutation and phenotype data from various public resources and made annotations for identified TFs in PlantTFDB 3.0 (Table 3), in addition to abundant annotations provided in the previous two versions (16,17). By integrating information from Entrez Gene (33), UniProtKB (34), GeneRIF (33) and mined by ourselves, we added related references for TFs.

Table 3.

Summary of annotations for TFs in PlantTFDB 3.0

TypeaSpeciesTFEntry
Expert-curated description2221286649
Expression
    UniGene4444 86245 239
    Microarray1415 42431 975
 Plant ontology56850174 162
Regulation
 Binding site/matrix24541729
 ChIP-chip/ChIP-seq15475
 microRNA12843
 Hormone1417803
Interaction109923101
Conserved element2370963 859
Phenotype24704147 684
Reference59500420 255

aNew types of annotations in this version are marked in bold.

Evolutionary conserved elements may work as transcriptional regulatory elements (35,36). Therefore, we collected these elements, which were identified based on the genome alignments of 9 crucifers (36) and 20 angiosperm plants (37), and added them into the current version, in addition to functional genomic annotations described earlier.

Orthologs usually have similar function and are widely used to explore functions of poorly studied proteins. To help users infer the functions of poorly studied TFs, we constructed MSAs and phylogenetic trees within the same family across 83 species, based on conserved DBDs. We further assigned 69 450 TFs into 3924 orthologous groups and constructed phylogenetic trees for each orthologous group. As an aid to decipher their evolutionary relationships, we also built trees for individual TF families within the same species. Hyperlinks to TF pages were added in the tree branches so that the users could browse them conveniently. The MSAs and phylogenetic trees in PlantTFDB 3.0 can be freely downloaded for further analyses. Direct links to TFs of A. thaliana, the best-studied model plant and the best-annotated species in PlantTFDB 3.0, were also generated for all TFs in other species.

TF prediction server

In recent years, the TF classification rules we constructed have been widely used to annotate TFs of newly sequenced genomes (38,39). In this regard, we set up a TF prediction server (http://planttfdb.cbi.pku.edu.cn/prediction.php) for users to identify TFs from their own protein sequences. As A. thaliana is the best-annotated species in PlantTFDB 3.0, links to the best hits in A. thaliana are provided for predicted TFs. Currently, users can upload up to 100 sequences and obtain results within a minute from our server.

Further direction

We have updated our PlantTFDB to version 3.0, which provides TF repertoires across the main lineages of green plants. The knowledge we collected, the OGs and phylogenetic trees we inferred are useful resources for further exploration of the physiological function and evolutionary relationship of TFs. We will continue to work on this project to refine the family assignment rules and the prediction pipeline, and collect more type of useful information for identified TFs in the future.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Natural Science Foundation of China [31071160;, 31171242]; China High-Tech Program [2006AA02Z334;, 2012AA020409]; China National Key Basic Research Program [2011CBA01102]; China National Outstanding Youth Talents Program; China National Science and Technology Infrastructure Program [2009FY120100]. Funding for open access charge: State Key Laboratory of Protein and Plant Gene Research.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors thank Joint Genome Institute for the genome annotation of four unpublished species, AGD for Amborella trichopoda, ICGC for Citrus clementina and AGI for three rice species. They also thank their users for their suggestions and comments. They specially thank Ying Dillaha for her language editing of the manuscript.

REFERENCES

1. Riechmann J, Heard J, Martin G, Reuber L, Keddie J, Adam L, Pineda O, Ratcliffe O, Samaha R, Creelman R. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science. 2000;290:2105–2110. [Abstract] [Google Scholar]
2. Pérez-Rodríguez P, Riaño-Pachón DM, Corrêa LGG, Rensing SA, Kersten B, Mueller-Roeber B. PlnTFDB: updated content and new features of the plant transcription factor database. Nucleic Acids Res. 2010;38:D822–D827. [Europe PMC free article] [Abstract] [Google Scholar]
3. Richardt S, Lang D, Reski R, Frank W, Rensing SA. PlanTAPDB, a phylogeny-based resource of plant transcription-associated proteins. Plant Physiol. 2007;143:1452–1466. [Abstract] [Google Scholar]
4. Yilmaz A, Nishiyama MY, Jr, Fuentes BG, Souza GM, Janies D, Gray J, Grotewold E. GRASSIUS: a platform for comparative regulatory genomics across the grasses. Plant Physiol. 2009;149:171–180. [Abstract] [Google Scholar]
5. Mochida K, Yoshida T, Sakurai T, Yamaguchi-Shinozaki K, Shinozaki K, Tran LS. LegumeTFDB: an integrative database of Glycine max, Lotus japonicus and Medicago truncatula transcription factors. Bioinformatics. 2010;26:290–291. [Abstract] [Google Scholar]
6. Fredslund J. DATFAP: a database of primers and homology alignments for transcription factors from 13 plant species. BMC Genomics. 2008;9:140. [Europe PMC free article] [Abstract] [Google Scholar]
7. Mochida K, Yoshida T, Sakurai T, Yamaguchi-Shinozaki K, Shinozaki K, Tran L-SP. TreeTFDB: An Integrative Database of the Transcription Factors from Six Economically Important Tree Crops for Functional Predictions and Comparative and Functional Genomics. DNA Res. 2013;20:151–162. [Europe PMC free article] [Abstract] [Google Scholar]
8. Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, Grotewold E. AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Res. 2011;39:D1118–D1122. [Europe PMC free article] [Abstract] [Google Scholar]
9. Iida K, Seki M, Sakurai T, Satou M, Akiyama K, Toyoda T, Konagaya A, Shinozaki K. RARTF: database and tools for complete sets of Arabidopsis transcription factors. DNA Res. 2005;12:247–256. [Abstract] [Google Scholar]
10. Rushton PJ, Bokowiec MT, Laudeman TW, Brannock JF, Chen X, Timko MP. TOBFAC: the database of tobacco transcription factors. BMC Bioinformatics. 2008;9:53. [Europe PMC free article] [Abstract] [Google Scholar]
11. Wang Z, Libault M, Joshi T, Valliyodan B, Nguyen H, Xu D, Stacey G, Cheng J. SoyDB: a knowledge database of soybean transcription factors. BMC Plant Biol. 2010;10:14. [Europe PMC free article] [Abstract] [Google Scholar]
12. Romeuf I, Tessier D, Dardevet M, Branlard G, Charmet G, Ravel C. wDBTF: an integrated database resource for studying wheat transcription factor families. BMC Genomics. 2010;11:185. [Europe PMC free article] [Abstract] [Google Scholar]
13. Guo A, He K, Liu D, Bai S, Gu X, Wei L, Luo J. DATF: a database of Arabidopsis transcription factors. Bioinformatics. 2005;21:2568–2569. [Abstract] [Google Scholar]
14. Gao G, Zhong Y, Guo A, Zhu Q, Tang W, Zheng W, Gu X, Wei L, Luo J. DRTF: a database of rice transcription factors. Bioinformatics. 2006;22:1286–1287. [Abstract] [Google Scholar]
15. Zhu QH, Guo AY, Gao G, Zhong YF, Xu M, Huang M, Luo J. DPTF: a database of poplar transcription factors. Bioinformatics. 2007;23:1307–1308. [Abstract] [Google Scholar]
16. Guo AY, Chen X, Gao G, Zhang H, Zhu QH, Liu XC, Zhong YF, Gu X, He K, Luo J. PlantTFDB: a comprehensive plant transcription factor database. Nucleic Acids Res. 2008;36:D966–D969. [Europe PMC free article] [Abstract] [Google Scholar]
17. Zhang H, Jin J, Tang L, Zhao Y, Gu X, Gao G, Luo J. PlantTFDB 2.0: update and improvement of the comprehensive plant transcription factor database. Nucleic Acids Res. 2011;39:D1114–D1117. [Europe PMC free article] [Abstract] [Google Scholar]
18. Acland A, Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, Bolton E, Bryant SH, Canese K, Church DM. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2013;41:D8–D20. [Europe PMC free article] [Abstract] [Google Scholar]
19. Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ, Lushbough C, Brendel V. PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 2008;36:D959–D965. [Europe PMC free article] [Abstract] [Google Scholar]
20. Reinhold H, Soyk S, Šimková K, Hostettler C, Marafino J, Mainiero S, Vaughan CK, Monroe JD, Zeeman SC. β-Amylase–Like Proteins Function as Transcription Factors in Arabidopsis, Controlling Shoot Growth and Development. Plant Cell. 2011;23:1391–1403. [Abstract] [Google Scholar]
21. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J. The Pfam protein families database. Nucleic Acids Res. 2012;40:D290–D301. [Europe PMC free article] [Abstract] [Google Scholar]
22. Eddy S. HMMER User's Guide: Biological sequence analysis using profile hidden Markov models. 2010 ( http://hmmer.org) (18 October 2013, date last accessed) [Google Scholar]
23. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012;40:D306–D312. [Europe PMC free article] [Abstract] [Google Scholar]
24. Cokol M, Nair R, Rost B. Finding nuclear localization signals. EMBO Rep. 2000;1:411–415. [Europe PMC free article] [Abstract] [Google Scholar]
25. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000;302:205–217. [Abstract] [Google Scholar]
26. Price MN, Dehal PS, Arkin AP. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490. [Europe PMC free article] [Abstract] [Google Scholar]
27. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. [Abstract] [Google Scholar]
28. Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K. Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 2012;158:590–600. [Abstract] [Google Scholar]
29. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. [Europe PMC free article] [Abstract] [Google Scholar]
30. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. [Europe PMC free article] [Abstract] [Google Scholar]
31. Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin Y-C, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A. The Norway spruce genome sequence and conifer genome evolution. Nature. 2013;497:579–584. [Abstract] [Google Scholar]
32. Lang D, Weiche B, Timmerhaus G, Richardt S, Riaño-Pachón DM, Corrêa LG, Reski R, Mueller-Roeber B, Rensing SA. Genome-wide phylogenetic comparative analysis of plant transcriptional regulation: a timeline of loss, gain, expansion, and correlation with complexity. Genome Biol. Evol. 2010;2:488–503. [Europe PMC free article] [Abstract] [Google Scholar]
33. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2011;39:D52–D57. [Europe PMC free article] [Abstract] [Google Scholar]
34. Consortium U. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013;41:D43–D47. [Europe PMC free article] [Abstract] [Google Scholar]
35. Baxter L, Jironkin A, Hickman R, Moore J, Barrington C, Krusche P, Dyer NP, Buchanan-Wollaston V, Tiskin A, Beynon J. Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants. Plant Cell. 2012;24:3949–3965. [Abstract] [Google Scholar]
36. Haudry A, Platts AE, Vello E, Hoen DR, Leclercq M, Williamson RJ, Forczek E, Joly-Lopez Z, Steffen JG, Hazzouri KM. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat. Genet. 2013;45:891–898. [Abstract] [Google Scholar]
37. Hupalo D, Kern AD. Conservation and functional element discovery in 20 angiosperm plant genomes. Mol. Biol. Evol. 2013;30:1729–1744. [Abstract] [Google Scholar]
38. Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A, Mane SP, et al. The genome of woodland strawberry (Fragaria vesca) Nat. Genet. 2011;43:109–116. [Europe PMC free article] [Abstract] [Google Scholar]
39. Jia J, Zhao S, Kong X, Li Y, Zhao G, He W, Appels R, Pfeifer M, Tao Y, Zhang X, et al. Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature. 2013;496:91–95. [Abstract] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/1870252
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/1870252

Article citations


Go to all (507) article citations

Other citations

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.