BioJava: an open-source framework for bioinformatics in 2012.

BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality.

Results

BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model.

Availability

BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.

Free full text

Bioinformatics. 2012 Oct 15; 28(20): 2693–2695.

Published online 2012 Aug 9. https://doi.org/10.1093/bioinformatics/bts494

PMCID: PMC3467744

PMID: 22877863

BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlić,^1,^* Andrew Yates,² Spencer E. Bliven,³ Peter W. Rose,¹ Julius Jacobsen,² Peter V. Troshin,⁴ Mark Chapman,⁵ Jianjiong Gao,⁶ Chuan Hock Koh,⁷ Sylvain Foisy,⁸ Richard Holland,⁹ Gediminas Rimša,¹⁰ Michael L. Heuer,¹¹ H. Brandstätter–Müller,¹² Philip E. Bourne,¹³ and Scooter Willis¹⁴

Andreas Prlić

¹San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA, ²European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, ³Bioinformatics Program, University of California San Diego, La Jolla, CA 92093, USA, ⁴College of Life Sciences, University of Dundee, Dundee DD1 5EH, UK, ⁵Department of Computer Science, University of Wisconsin-Madison, WI 53706, USA, ⁶Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA, ⁷NUS Graduate School for Integrative Sciences and Engineering, Singapore 117597, Singapore, ⁸Genetics and Genomics Medicine of Inflammation, Montreal Heart Institute, Montreal, Quebec, Canada H1T 1C8, ⁹Eagle Genomics Ltd, Babraham Research Campus, Cambridge CB22 3AT, UK, ¹⁰Faculty of Mathematics and Informatics, Vilnius University, LT-03225 Vilnius, Lithuania, ¹¹Harbinger Partners, Inc., St Paul, MN 55127, USA, ¹²University of Applied Sciences Upper Austria, 4232 Hagenberg, Austria, ¹³Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA 92093, USA and ¹⁴Genomics Core, Scripps Florida, Jupiter, FL 33458, USA

Find articles by Andreas Prlić

Andrew Yates

Find articles by Andrew Yates

Spencer E. Bliven

Find articles by Spencer E. Bliven

Peter W. Rose

Find articles by Peter W. Rose

Julius Jacobsen

Find articles by Julius Jacobsen

Peter V. Troshin

Find articles by Peter V. Troshin

Mark Chapman

Find articles by Mark Chapman

Jianjiong Gao

Find articles by Jianjiong Gao

Chuan Hock Koh

Find articles by Chuan Hock Koh

Sylvain Foisy

Find articles by Sylvain Foisy

Richard Holland

Find articles by Richard Holland

Gediminas Rimša

Find articles by Gediminas Rimša

Michael L. Heuer

Find articles by Michael L. Heuer

H. Brandstätter–Müller

Find articles by H. Brandstätter–Müller

Philip E. Bourne

Find articles by Philip E. Bourne

Scooter Willis

Find articles by Scooter Willis

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Abstract

Motivation: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality.

Results: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model.

Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists

Contact: andreas.moc.liamg@cilrp

Go to:

1 INTRODUCTION

BioJava is an established open-source project driven by an active developer community (Holland et al., 2008). It provides a framework for processing commonly used biological data and has seen contributions from >60 developers in the 12 years since its creation. The supported data range in scope from DNA and protein sequence information up to the level of 3D protein structures. BioJava provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

The project is hosted by the Open Bioinformatics Foundation (OBF, http://www.open-bio.org), which provides the source code repository, bug tracking database and email mailing lists. It also supports projects SUCH AS BioPerl (Stajich et al., 2002), BioPython (Cock et al., 2009), BioRuby (Goto et al., 2010), EMBOSS (Rice et al., 2000) and others.

Go to:

2 METHODS

Over the last 2 years, large parts of the original code base have been rewritten. BioJava 3 is a clear departure from the version 1 series. It now consists of several independent modules built using Maven (http://maven.apache.org). The original code has been moved into a separate biojava-legacy project, which is still available for backwards compatibility. In the following, we describe several of the new modules and highlight some of the new features that are included in the latest version of BioJava.

2.1 Core module

The core module provides classes to model nucleotide and amino acid sequences and their inherent relationships. Emphasis was placed on using Java classes and method names to describe sequences that would be familiar to the biologist and provide a concrete representation of the steps in going from a gene sequence to a protein sequence to the computer scientist.

BioJava 3 leverages recent innovations in Java. A sequence is defined as a generic interface, allowing the framework to build a collection of utilities which can be applied to any sequence such as multiple ways of storing data. In order to improve the framework’s usability to biologists, we also define specific classes for common types of sequences, such as DNA and proteins. One area that highlights this work is the translation engine, which allows the interconversion of DNA, RNA and amino acid sequences. The engine can handle details such as choosing the codon table, converting start codons to a methionine, trimming stop codons, specifying the reading frame and handling ambiguous sequences (‘R’ for purines, for example). Alternatively, the user can manually override defaults for any of these.

The storage of sequences is designed to minimize memory usage for large collections using a ‘proxy’ storage concept. Various proxy implementations are provided which can store sequences in memory, fetch sequences on demand from a web service such as UniProt or read sequences from a FASTA file as needed. The latter two approaches save memory by not loading sequence data until it is referenced in the application. This concept can be extended to handle very large genomic datasets, such as NCBI GenBank or a proprietary database.

2.2 Protein structure modules

The protein structure modules provide tools for representing and manipulating 3D biomolecular structures, with the particular focus on protein structure comparison. It contains Java ports of the FATCAT algorithm (Ye and Godzik, 2003) for flexible and rigid body alignment, a version of the standard Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998) as well as a new version of CE that can detect circular permutations in proteins (Bliven and Prlić, 2012). These algorithms are used to provide the RCSB Protein Data Bank (PDB) (Rose et al., 2011) Protein Comparison Tool as well as systematic comparisons of all proteins in the PDB on a weekly basis (Prlić et al., 2010).

Parsers for PDB and mmCIF file formats (Bernstein et al., 1977; Fitzgerald et al., 2006) allow the loading of structure data into a reusable data model. Notably, this feature is used by the SIFTS project to map between UniProt sequences and PDB structures (Velankar et al., 2005). Information from the RCSB PDB can be dynamically fetched without the need to manually download data. For visualization, an interface to the 3D viewer Jmol (Hanson, 2010) http://www.jmol.org/ is provided. Work is underway for better interaction with the RCSB PDB viewers (Moreland et al., 2005).

2.3 Genome and sequencing modules

The genome module is focused on the creation of gene sequence objects from the core module by supporting the parsing of GTF files generated by GeneMark (Besemer and Borodovsky, 2005), GFF2 files generated by GeneID (Blanco and Abril, 2009) and GFF3 files generated by Glimmer (Kelley et al., 2011). The gene sequences can then be written out as a GFF3 format for importing into GMOD (Stein et al., 2002). A separate sequencing module provides memory efficient, low level and streaming I/O support for several common variants of the FASTQ file format from next generation sequencers (Cock et al., 2010).

2.4 Alignment module

The alignment module supplies standard algorithms for sequence alignment and establishes a foundation to perform progressive multiple sequence alignments. For pairwise alignments, an implementation of the Needleman–Wunsch algorithm computes the optimal global alignment (Needleman and Wunsch, 1970) and the Smith–Waterman algorithm calculates local alignments (Smith and Waterman, 1981). In addition to these standard pairwise algorithms, the module includes the Guan–Uberbacher algorithm to perform global sequence alignment efficiently using only linear memory (Guan and Uberbacher, 1996). This routine also allows predefined anchors to be manually specified that will be included in the alignment produced. Any of the pairwise routines can also be used to perform progressive multiple sequence alignment. Both pairwise and multiple sequence alignments output to standard alignment formats for further processing or visualization.

2.5 ModFinder module

The ModFinder module provides new methods to identify and classify protein modifications in protein 3D structures. More than 400 different types of protein modifications (phosphorylation, glycosylation, disulfide bonds metal chelation, etc.) were collected and curated based on annotations in PSI-MOD (Montecchi-Palazzi et al., 2008), RESID (Garavelli, 2004) and RCSB PDB (Berman et al., 2000). The module provides an API for detecting protein modifications within protein structures. Figure 1 shows a web-based interface for displaying modifications which was created using the ModFinder module. Future developments are planned to include additional protein modifications by integrating other resources such as UniProt (Farriol-Mathis et al., 2004).

An external file that holds a picture, illustration, etc.
Object name is bts494f1p.jpg

Fig. 1.

An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO; Chen et al., 2002). Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display

2.6 Amino acid properties module

The goal of the amino acid properties module is to provide a range of accurate physicochemical properties for proteins. The following peptide properties can currently be calculated: molecular weight, extinction coefficient, instability index, aliphatic index, grand average of hydropathy, isoelectric point and amino acid composition.

To aid proteomic studies, the module includes precise molecular weights for common isotopically labeled or post-translationally modified amino acids. Additional types of PTMs can be defined using simple XML configuration files. This flexibility is especially valuable in situations where the exact mass of the peptide is important, such as mass spectrometry experiments.

2.7 Protein disorder module

BioJava now includes a port of the Regional Order Neural Network (RONN) predictor (Yang et al., 2005) for predicting disordered regions of proteins. BioJava’s implementation supports multiple threads, making it ~3.2-times faster than the original C implementation on a modern quad-core machine.

The protein disorder module is distributed both as part of the BioJava library and as a standalone command line executable. The executable is optimized for use in automated analysis pipelines to predict disorder in multiple proteins. It can produce output optimized for either human readers or machine parsing.

2.8 Web service access module

More and more bioinformatics tools are becoming accessible through the web. As such, BioJava now contains a web services module that allows bioinformatics services to be accessed using REST protocols. Currently, two services are implemented: NCBI Blast through the Blast URLAPI (previously known as QBlast) and the HMMER web service at hmmer.janelia.org (Finn et al., 2011).

Go to:

3 CONCLUSION

The BioJava 3 library provides a powerful API for analyzing DNA, RNA and proteins. It contains state-of-the-art algorithms to perform various calculations and provides a flexible framework for rapid application development in bioinformatics. The library also provides lightweight interfaces to other projects that specialize in visualization tools. The transition to Maven made managing external dependencies much easier, allowing the use of external libraries without overly complicating the installation procedure for users.

The BioJava project site provides an online cookbook which demonstrates the use of all modules through short recipes of common tasks. We are looking forward to extending the BioJava 3 library with more functionality over the coming years and welcome contributions of novel components by the community.

Go to:

ACKNOWLEDGEMENTS

The authors thank everybody who contributed code, documentation or ideas, in particular A. Al-Hossary, R. Thornton, J. Warren, A. Draeger, G. Waldon and G. Barton. Each contribution is appreciated, although the total list of contributors is too long to be reproduced here. They also thank the Open Bioinformatics Foundation for project hosting.

Funding: The RCSB PDB (NSF DBI 0829586 to A.P., P.W.R. and P.E.B.); Google Summer of Code in 2010 and 2011 (to J.G., M.C. and C.H.K.) and Scottish Universities Life Sciences Alliance (SULSA) (to P.T.).

Conflict of Interest: None declared.

Go to:

REFERENCES

Berman H.M., et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [Europe PMC free article] [Abstract] [Google Scholar]
Bernstein F.C., et al. The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 1977;112:535–542. [Abstract] [Google Scholar]
Besemer J., Borodovsky M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 2005;33(Web Server issue):W451–W454. [Europe PMC free article] [Abstract] [Google Scholar]
Blanco E., Abril J.F. Computational gene annotation in new genome assemblies using GeneID. Methods Mol. Biol. 2009;537:243–261. [Abstract] [Google Scholar]
Bliven S., Prlić A. Circular permutation in proteins. PLoS Comput. Biol. 2012;8:e1002445. [Europe PMC free article] [Abstract] [Google Scholar]
Chen K., et al. Azotobacter vinelandii ferredoxin I: a sequence and structure comparison approach to alteration of [4Fe-4S]2+/+ reduction potential. J. Biol. Chem. 2002;277:5603–5610. [Abstract] [Google Scholar]
Cock P.J.A., et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. [Europe PMC free article] [Abstract] [Google Scholar]
Cock P.J.A., et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. [Europe PMC free article] [Abstract] [Google Scholar]
Farriol-Mathis N., et al. Annotation of post-translational modifications in the Swiss-Prot knowledge base. Proteomics. 2004;4:1537–1550. [Abstract] [Google Scholar]
Finn R.D., et al. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39(Web Server issue):W29–W37. [Europe PMC free article] [Abstract] [Google Scholar]
Fitzgerald P.M.D., et al. Macromolecular dictionary (mmCIF) In: Hall S.R., McMahon B., editors. 2006. Online. Vol. G. Springer, pp. 295–443. http://www.springerlink.com/content/v4r60un4575p5234/ [Google Scholar]
Garavelli J.S. The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics. 2004;4:1527–1533. [Abstract] [Google Scholar]
Goto N., et al. BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics. 2010;26:2617–2619. [Europe PMC free article] [Abstract] [Google Scholar]
Guan X., Uberbacher E.C. Alignments of DNA and protein sequences containing frameshift errors. Comput. Appl. Biosci. 1996;12:31–40. [Abstract] [Google Scholar]
Hanson R.M. Jmol a paradigm shift in crystallographic visualization. J. Appl. Crystallogr. 2010;43:1250–1260. [Google Scholar]
Holland R.C.G., et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24:2096–2097. [Abstract] [Google Scholar]
Kelley D.R., et al. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2011;40:1–12. [Europe PMC free article] [Abstract] [Google Scholar]
Montecchi-Palazzi L., et al. The PSI-MOD community standard for representation of protein modification data. http://www.ncbi.nlm.nih.gov/pubmed/18688235. 2008 [Abstract] [Google Scholar]
Moreland J.L., et al. The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications. BMC Bioinformatics. 2005;6:21. [Europe PMC free article] [Abstract] [Google Scholar]
Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 1970;48:443–453. [Abstract] [Google Scholar]
Prlić A., et al. Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics. 2010;26:2983–2985. [Europe PMC free article] [Abstract] [Google Scholar]
Rice P., et al. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16:276–277. [Abstract] [Google Scholar]
Rose P.W., et al. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 2011;39(Database issue):D392–D401. [Europe PMC free article] [Abstract] [Google Scholar]
Shindyalov I.N., Bourne P.E. Protein structure alignment by incremental combinatorial extension {(CE)} of the optimal path. Protein Eng. 1998;11:739–747. [Abstract] [Google Scholar]
Smith T.F., Waterman M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. [Abstract] [Google Scholar]
Stajich J.E., et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. [Europe PMC free article] [Abstract] [Google Scholar]
Stein L.D., et al. The Generic Genome Browser: a building block for a model organism system database. Genome Res. 2002;12:1599–1610. [Europe PMC free article] [Abstract] [Google Scholar]
Velankar S., et al. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 2005;33(Database issue):D262–D265. [Europe PMC free article] [Abstract] [Google Scholar]
Yang Z.R., et al. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005;21:3369–3376. [Abstract] [Google Scholar]
Ye Y., Godzik A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics. 2003;19(Suppl. 2):II246–II255. [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/bioinformatics/bts494

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/28/20/2693/526599/bts494.pdf

Citations & impact

Impact metrics

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/882218

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/882218

Article citations

PSnpBind-ML: predicting the effect of binding site mutations on protein-ligand binding affinity.
Ammar A, Cavill R, Evelo C, Willighagen E
J Cheminform, 15(1):31, 02 Mar 2023
Cited by: 2 articles | PMID: 36864534 | PMCID: PMC9983232
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
PSnpBind: a database of mutated binding site protein-ligand complexes constructed using a multithreaded virtual screening workflow.
Ammar A, Cavill R, Evelo C, Willighagen E
J Cheminform, 14(1):8, 28 Feb 2022
Cited by: 4 articles | PMID: 35227289 | PMCID: PMC8886843
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
ConCysFind: a pipeline tool to predict conserved amino acids of protein sequences across the plant kingdom.
Moore M, Wesemann C, Gossmann N, Sahm A, Krüger J, Sczyrba A, Dietz KJ
BMC Bioinformatics, 21(1):490, 31 Oct 2020
Cited by: 1 article | PMID: 33129266 | PMCID: PMC7603750
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
MFsim-an open Java all-in-one rich-client simulation environment for mesoscopic simulation.
van den Broek K, Daniel M, Epple M, Hein JM, Kuhn H, Neumann S, Truszkowski A, Zielesny A
J Cheminform, 12(1):29, 01 May 2020
Cited by: 4 articles | PMID: 33430951 | PMCID: PMC7195747
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
The structural basis of the genetic code: amino acid recognition by aminoacyl-tRNA synthetases.
Kaiser F, Krautwurst S, Salentin S, Haupt VJ, Leberecht C, Bittrich S, Labudde D, Schroeder M
Sci Rep, 10(1):12647, 28 Jul 2020
Cited by: 18 articles | PMID: 32724042 | PMCID: PMC7387524
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (90) article citations

Other citations

Wikipedia

https://en.wikipedia.org/wiki/BioJava

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC3467744?xr=true

Protein structures in PDBe

(1 citation) PDBe - 1GAO
View structure

Search life-sciences literature (45,103,589 articles, preprints and more)

BioJava: an open-source framework for bioinformatics in 2012.

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Unlabelled

Results

Availability

Free full text

BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlić

Andrew Yates

Spencer E. Bliven

Peter W. Rose

Julius Jacobsen

Peter V. Troshin

Mark Chapman

Jianjiong Gao

Chuan Hock Koh

Sylvain Foisy

Richard Holland

Gediminas Rimša

Michael L. Heuer

H. Brandstätter–Müller

Philip E. Bourne

Scooter Willis

Abstract

1 INTRODUCTION

2 METHODS

2.1 Core module

2.2 Protein structure modules

2.3 Genome and sequencing modules

2.4 Alignment module

2.5 ModFinder module

2.6 Amino acid properties module

2.7 Protein disorder module

2.8 Web service access module

3 CONCLUSION

ACKNOWLEDGEMENTS

REFERENCES

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Other citations

Wikipedia

Data

Data behind the article

BioStudies: supplemental material and supporting data

Protein structures in PDBe

Similar Articles

Partnerships & funding