Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Background

Mass spectrometry (MS) coupled with online separation methods is commonly applied for differential and quantitative profiling of biological samples in metabolomic as well as proteomic research. Such approaches are used for systems biology, functional genomics, and biomarker discovery, among others. An ongoing challenge of these molecular profiling approaches, however, is the development of better data processing methods. Here we introduce a new generation of a popular open-source data processing toolbox, MZmine 2.

Results

A key concept of the MZmine 2 software design is the strict separation of core functionality and data processing modules, with emphasis on easy usability and support for high-resolution spectra processing. Data processing modules take advantage of embedded visualization tools, allowing for immediate previews of parameter settings. Newly introduced functionality includes the identification of peaks using online databases, MSn data support, improved isotope pattern support, scatter plot visualization, and a new method for peak list alignment based on the random sample consensus (RANSAC) algorithm. The performance of the RANSAC alignment was evaluated using synthetic datasets as well as actual experimental data, and the results were compared to those obtained using other alignment algorithms.

Conclusions

MZmine 2 is freely available under a GNU GPL license and can be obtained from the project website at: http://mzmine.sourceforge.net/. The current version of MZmine 2 is suitable for processing large batches of data and has been applied to both targeted and non-targeted metabolomic analyses.

Free full text 


Logo of bmcbioiBioMed Central web sitethis articleSearchManuscript submissionRegistrationJournal front page
BMC Bioinformatics. 2010; 11: 395.
Published online 2010 Jul 23. https://doi.org/10.1186/1471-2105-11-395
PMCID: PMC2918584
PMID: 20650010

MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data

Associated Data

Supplementary Materials

Abstract

Background

Mass spectrometry (MS) coupled with online separation methods is commonly applied for differential and quantitative profiling of biological samples in metabolomic as well as proteomic research. Such approaches are used for systems biology, functional genomics, and biomarker discovery, among others. An ongoing challenge of these molecular profiling approaches, however, is the development of better data processing methods. Here we introduce a new generation of a popular open-source data processing toolbox, MZmine 2.

Results

A key concept of the MZmine 2 software design is the strict separation of core functionality and data processing modules, with emphasis on easy usability and support for high-resolution spectra processing. Data processing modules take advantage of embedded visualization tools, allowing for immediate previews of parameter settings. Newly introduced functionality includes the identification of peaks using online databases, MSn data support, improved isotope pattern support, scatter plot visualization, and a new method for peak list alignment based on the random sample consensus (RANSAC) algorithm. The performance of the RANSAC alignment was evaluated using synthetic datasets as well as actual experimental data, and the results were compared to those obtained using other alignment algorithms.

Conclusions

MZmine 2 is freely available under a GNU GPL license and can be obtained from the project website at: http://mzmine.sourceforge.net/. The current version of MZmine 2 is suitable for processing large batches of data and has been applied to both targeted and non-targeted metabolomic analyses.

Background

Mass spectrometry (MS) coupled with online separation methods, such as liquid chromatography (LC), is commonly applied for differential and quantitative profiling of biological samples in metabolomic and proteomic research. Such approaches are useful in the domains of systems biology, functional genomics, and biomarker discovery. One of the ongoing challenges of such molecular profiling approaches is the development of better data processing methods. Several software packages have been developed for this purpose, and have been extensively reviewed by Katajamaa and Orešič [1].

The recent introduction of mzML, an open and universal format for MS data [2], represents an important milestone in the effort to address the issues of MS data exchange and standardization. It also underlines the need for a flexible and universal software framework to provide the necessary support for data import, export, and visualization, thus allowing the rapid development of specialized data-processing methods.

MZmine was first introduced in 2005 as an open-source software toolbox for LC-MS data processing [3]. The first version of MZmine defined the data analysis workflow and implemented simple methods for data processing and visualization [3,4]. The software has been applied to numerous metabolomic analyses [5-10] and comparative studies with other related software packages have been performed [9,11]. A weakness of MZmine was insufficient modularity in its initial design, thus limiting the possibility of expanding the software with new methods developed by the scientific community. For this reason, the new release, MZmine 2, was completely redesigned to support modularity. Here we describe the architecture of MZmine 2 as well as its basic features. We also introduce a new and efficient method for peak list alignment that was implemented in MZmine 2.

Implementation

MZmine 2 was developed using Java technology, and is therefore platform independent. The software has been tested on the Windows, Mac OS X, and Linux platforms. We focused on three main aims during the software design and implementation.

First, the framework should be flexible and allow for easy and straightforward development of new data processing modules. We addressed this by keeping a strict separation between the application core and individual modules for data processing and visualization (Figure (Figure1).1). A compact data model was designed and the code of each Java class code was kept short and intuitive. To support the development of new modules, we provided an online tutorial available at the project web site.

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-1.jpg

MZmine 2 software architecture and its main modules

Second, the graphical interface of the application should be intuitive and easy to use. For this purpose, critical data processing methods such as peak picking were linked to embedded visualization modules, providing online previews during parameter setup. Additionally, the use of any data-processing method in MZmine 2 does not remove the original (unprocessed) data, giving the user the option to return back to previous results or raw data at any stage of data processing.

The third goal was to provide good support for processing high-resolution MS data, e.g., as obtained from Orbitrap or Fourier transform ion cyclotron resonance MS instruments. We designed the data import and peak detection modules to maintain the precision of the imported data without any degradation due to inadequate resampling. Because the use of high-resolution data suggests an increased data volume, MZmine 2 was tested and optimized with large datasets (on the order of gigabytes).

The flexibility of the Java environment allows MZmine 2 to take advantage of several open-source libraries, including JFreeChart (http://www.jfree.org/jfreechart/) for the TIC, spectra, 2D and other visualizers, VisAD (http://www.ssec.wisc.edu/~billh/visad.html) for the 3D visualizer, Chemistry Development Kit (CDK) [12] for calculating isotopic distributions, JChemPaint (http://jchempaint.sourceforge.net/) for rendering 2D molecular structures, and Jmol (http://jmol.sourceforge.net/) for rendering 3D molecular structures. These libraries are included in the MZmine 2 distribution.

Results

The typical MS data processing workflow comprises raw data file import, filtering/smoothing (optional), peak picking, peak list deisotoping, alignment, gap filling, and normalization [4]. The MZmine 2 modules cover all these workflow stages and also include additional functionality for the visualization and interpretation of the results. Only features new to MZmine 2 are described in this section.

Project management

One of the new core features of MZmine 2 is project management, which allows the user to track and store intermediate results. Each data-processing step can be performed multiple times with different parameters and the results can be observed and compared. The data processing pipeline settings (e.g., algorithms and parameters used, reference peak lists) can be stored for future applications. Direct export of the peak list data to comma-separated values (CSV) or XML files is also possible.

Raw data file format support

MZmine 2 can read and process both unit mass resolution and accurate mass resolution MS data in both continuous and centroid modes, including fragmentation (MSn) scans. Raw data import is modularized and the currently supported file formats are mzML (1.0 and 1.1), mzXML (2.0, 2.1 and 3.0), mzData (1.04 and 1.05), NetCDF, and RAW format used natively by Thermo Fisher Scientific instruments (requires installation of Thermo Xcalibur). Support for other file formats can be implemented as additional plug-ins.

Data visualization

MZmine 2 includes several of visualization modules (Figure (Figure2),2), all of which were newly implemented for this release. Following the goal of providing the user with an intuitive interface, the visualizers automatically annotate raw data with the obtained peak picking and identification results, allowing for quick orientation when large amounts of data are being processed.

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-2.jpg

Screenshot of MZmine 2 showing multiple visualization modules. The specific panels included are: (A) imported samples, (B) peak lists including single peak list contents, (C) peak shapes for an identified metabolite across multiple samples, (D) MS/MS spectrum of a metabolite, (E) combined base peak plot for multiple samples, (F) scatter plot of peak areas across two samples, (G) 2D plot of a detected peak, mass-to-charge ratio vs. retention time, (H) 3D view of a detected peak, and (I) intensity plot for specific peaks across multiple samples.

Quantitative results in the form of peak lists may be observed using a table visualizer or chart-plotting modules (Figure (Figure2I).2I). The scatter plot visualizer (Figure (Figure2F)2F) has proven to be very useful for efficient comparison of multiple samples [13].

Peak detection

Feature detection is a critical step in MS data processing. The peak detection methods and their implementations should be flexible enough to deal with great differences in data obtained from different instruments, such as variable mass resolution, chromatographic resolution and peak shape, or background noise. In MZmine 2, peak detection is performed in several customizable steps (Figure (Figure3).3). Previews are provided to allow for optimal selection of parameter values.

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-3.jpg

Peak detection modules with previews. (A) Mass detection (centroiding) module. Recognized m/z peaks are shown in red. In the insets, details of a single m/z peak are shown, indicating the full width at half maximum approach to the m/z value calculation. (B) Fourier transform mass spectrometry shoulder peaks filter. In the preview panel, the main detected peak is indicated with the red line, while shoulder peaks are indicated with the yellow lines. (C) Peak deconvolution. Each individual recognized peak within the chromatogram is indicated by a different color. (D) Experimental peak shape modeler. A Gaussian peak model (pink) is fitted to the deconvoluted chromatographic peak's data points (blue).

In the first step (Figure (Figure3A),3A), each MS spectrum is processed individually and converted to pairs of m/z and intensity values (in other words, each mass spectrum is centroided). Several algorithms are provided as plug-ins, each suitable for a different type of mass spectra. The "Local maxima" algorithm is a simple algorithm suitable for demonstrating the process: it detects each local maximum in the spectrum. The "Recursive threshold" algorithm is based on an earlier method implemented in MZmine [3,4] and adds two additional parameters of minimum and maximum peak m/z width. This method reduces the false positives by avoiding detection of noise peaks. The "Wavelet transform" algorithm is particularly suitable for noisy data. It processes each spectrum using continuous wavelet transform, matching the m/z peaks to the "Mexican hat" wavelet model. This algorithm is based on a previously reported method [14]. The "Exact mass" algorithm assumes high quality spectra (high mass resolution, low noise) and determines the center of each m/z peak using the "full width at half maximum" paradigm: m/z value is placed in the middle of the line, which crosses the peak at half of the maximum intensity (as shown in the insets in Figure Figure3A).3A). Finally, the "Centroid" algorithm is suitable for already centroided data. It detects all data points above the specified noise level as m/z peaks.

Data obtained by Fourier transform mass spectrometry instruments provide very high mass resolution, but suffer from the presence of noise signals known as "shoulder peaks" (Figure (Figure3B).3B). These peaks are residues of the Fourier transform function calculated by the instrument and their intensity is usually below 5% of the intensity of the main (true) m/z peak. To remove these noise peaks, we introduced an optional filtration plug-in that builds a theoretical model (such as Gaussian or Lorentzian) with given mass resolution around each peak, and removes all noise peaks below this model. Peaks are processed in the order of decreasing intensity. In the preview (Figure (Figure3B),3B), the main m/z signal is indicated by the red color, while the shoulder peaks subject to removal are indicated in yellow. Again, it is possible to implement other filtration algorithms as plug-ins.

The next step consists of an algorithm that connects consecutive m/z values spanning over multiple scans into chromatogram objects. The default algorithm provided by MZmine 2 connects m/z values in the order of their intensity, with the most intense peaks connected first. A chromatogram spanning a given minimal time range is constructed for each m/z value (within user-defined tolerance). Each chromatogram is then deconvoluted into individual chromatographic peaks (Figure (Figure3C).3C). Several algorithms are provided as plug-ins. The "Baseline cut-off" algorithm recognizes each chromatographic peak that has an intensity above a given minimum level and spans over a given minimum time range. The "Noise amplitude" algorithm adds another parameter specifying the intensity range, which is considered noisy. The algorithm then finds the intensity level where most of the noise is concentrated and sets the baseline level to this intensity, individually for each chromatogram. Following the setting of the baseline, the procedure is the same as the "Baseline cut-off" algorithm. The Savitzky-Golay algorithm uses the smoothed second derivative of the chromatogram curve to detect the borders of individual peaks. The "Local minimum search" algorithm attempts to identify local minima in the chromatogram as border points between individual peaks. Several restrictions are placed on possible peak shapes, such as minimum absolute and relative intensities, or a minimum ratio between peak maximum and edge.

We also implemented an experimental module, which fits the (potentially noisy) set of data points of each deconvoluted peak with an ideal peak model such as Gaussian or Exponentially Modified Gaussian (Figure (Figure3D).3D). Such an approach may reduce the chromatographic noise between samples, but the practical applicability of this method has not yet been thoroughly validated.

Peak identification

Assignment of intuitive metabolite or peptide names to detected m/z values greatly assists with the process of data interpretation. In MZmine 2, identification of peaks can be performed either by searching a custom database of m/z values and retention times, or by connecting to an online resource such as PubChem [15], KEGG [16], METLIN [17], or HMDB [18] directly from the MZmine 2 interface (Figure (Figure4).4). For each ion subjected to identification, its neutral molecular mass (mneutral) is calculated from its m/z value. For that purpose, the charge of the ion (z) can be automatically determined from its isotope pattern. Ionization mode (positive or negative) and ionization adduct (e.g. H+, Na+, K+, etc.) are selected by the user as parameters. Neutral mass is then calculated as mneutral = (m/z × z) ± madduct, where the sign (±) is defined by the ionization mode and madduct is the mass of the selected ionization adduct. The neutral mass mneutral is the primary term for database search, within user-specified tolerance. Isotopic pattern similarity can be used as a second filter to select optimal candidates, by comparing the ratios of the detected isotopes and matching isotopes from the predicted isotopic pattern of the database compound. Because the online identification module is itself modularized, support for other molecular databases can be easily added. For proteomic applications, a module allowing identification of peptide peaks using the MASCOT [19] search engine and MS/MS spectra is under development.

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-4.jpg

Peak identification using the PubChem Compound database. (A) A peak list showing the row selected for identification. (B) Dialog for setting search parameters. (C) Table of candidates obtained from the database within a given mass tolerance. (D) 2D and 3D structural views of the candidate compound.

RANdom SAmple Consensus (RANSAC) aligner

The purpose of peak list alignment is to match relevant peaks across multiple samples. The original MZmine software introduced a simple alignment algorithm that first creates an empty master peak list and then aligns each peak from given peak lists (samples) to the best candidate of the master list using a two-dimensional alignment window (AW) represented by user-specified m/z and retention time tolerances. If no suitable candidate is found, a new row is created in the master list. In MZmine 2, this algorithm is referred to as the "Join aligner". One disadvantage of the Join aligner is the inability to cope with a non-linear deviation of the retention times among samples. For this purpose, we introduced a new peak list alignment method based on the RANSAC algorithm.

The RANSAC algorithm [20] is a non-deterministic iterative algorithm that estimates parameters of a mathematical model from a set of observed data, which may include outliers. The probability of obtaining a good result increases with the number of iterations. In each iteration, a random subset of observed data points is selected and a model is fit to this data. In our specific case, we used 4 points to find a non-linear model. The remaining data is tested against the fitted model and if a value fits well, it is considered a part of the model. Finally, the model is evaluated and when the iteration is finished, the model with the most data points fitted to it is considered the best.

The RANSAC method of alignment makes use of two user-defined two-dimensional windows, the RANSAC window (RW) and Alignment window (AW), respectively. The RW is defined by the m/z threshold rm0 and retention time threshold rr0, and AW constitutes the same m/z threshold rm0 but a different retention time threshold ar0. The retention time threshold in RW should be as big as the maximum observed deviation in the retention time among all peaks. The procedure for aligning a sample Sj with the master list L is as follows:

Step 1: For every row i in L, let

ri = the average retention time of all individual peaks in the row

mi = average m/z of all individual peaks in the row

RWi = [(m, r) | mi - rm0 m mi + rm0 and ri - rr0 r ri + rr0], the RANSAC window for row i.

Then, for row i in L, mark all peaks in sample Sj in RWi as candidate alignments.

Step 2: Build a scatter plot representation of all candidate alignments, and apply the RANSAC algorithm to build a candidate model for alignment. This model represents a list of matching retention times.

Step 3: Apply the locally-weighted scatterplot smoothing (LOESS) method for regression [21] on all points in the model obtained with RANSAC.

Step 4: Using this regression model, for each row i in L, predict the correction for the retention time shift to locate the new center (mi, r'i) of the alignment window AWi. RANSAC alignment can correct the retention time deviation by centering the position of the AW to the correct position in the new sample.

Thus, the alignment window AWi = [(m, r) | mi - rm0 m mi + rm0 and r'i - ar0 r r'i + ar0]

Step 5: For each row i in L, apply the Join algorithm for alignment using the alignment window AWi.

Figure Figure55 shows a preview of the RANSAC alignment in MZmine 2. Each dot represents a candidate alignment of two peaks. Red dots represent those candidate alignments that were fitted to the best model (blue line).

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-5.jpg

RANSAC aligner. Dialog shows preview of RANSAC alignment of two peak lists using the given parameters. Each possible candidate alignment (peak pair) within a defined m/z and retention time tolerance is shown as a dot. A model is fitted to the data (blue line) and red dots indicate those fitting to the model and therefore selected for the final alignment.

RANSAC aligner performance

Two types of errors can be introduced during the alignment process [11]. Either two non-related peaks could be matched, or the matching of two related peaks could be omitted. A variable called "precision" represents the proportion of true alignments out of all alignments found by the algorithm. The proportion of peaks that are correctly aligned by the algorithm out of all true alignments inside the dataset is called "recall". These two variables together represent the quality of the alignment. To test whether the newly introduced RANSAC algorithm performs better than the Join alignment, the results of two different approaches were compared.

First, 12 synthetic datasets were created using samples from 12 different lipidomic studies. A single sample from each study was used as a seed to create a synthetic set of 20 samples. These 20 samples contained identical information (peaks), but a random non-linear deviation in the retention time was introduced into each one. The MZmine 2 projects of all 12 datasets are available on-line (see Dataset download). Each dataset was aligned using the RANSAC aligner and Join aligner with three different retention time tolerance thresholds (50 s, 20 s, and 5 s). Parameters used for alignment are specified in Table Table1.1. Run times of the RANSAC aligner were measured and are reported in Table Table2.2. Precision and recall values were calculated and the average results are shown in Figure Figure66 (numerical results are available in Additional file 1). Only the use of the RANSAC algorithm achieved 100% in both precision and recall performance on these synthetic data sets.

Table 1

Parameter values used for aligning the 12 synthetic data sets and the real proteomic (P1 and P2) and metabolomic (M1 and M2) data sets using the RANSAC and Join aligners.

Parameter12 synthetic data setsProteomics dataMetabolomics data

Data set P1Data set P2Data set M1Data set M2
m/z tolerance0.05 m/z1.5 m/z1.5 m/z0.03 m/z0.025 m/z

RT tolerance after correction0:2502:302:3000:5000:30

RT tolerance0:5003:3005:0000:3000:30

RANSAC iterations500050000500001500015000

Minimum number of points20%2.00%0.10%20.00%20.00%

Threshold value4 seconds4 seconds15 seconds4 seconds4 seconds

Non-linear modelyesyesnoyesyes

Table 2

Run times of the RANSAC aligner for aligning the 12 synthetic data sets and the real proteomic (P1 and P2) and metabolomic (M1 and M2) data sets.

Data setRun time (min)

Run 1Run 2Run 3Average
Synthetic data set 10.170.150.160.16

Synthetic data set 20.320.310.310.31

Synthetic data set 30.440.410.420.42

Synthetic data set 40.460.450.450.45

Synthetic data set 50.620.670.740.68

Synthetic data set 60.390.380.390.39

Synthetic data set 70.540.540.550.55

Synthetic data set 80.250.260.250.25

Synthetic data set 90.790.800.840.81

Synthetic data set 100.730.730.720.73

Synthetic data set 115.244.115.174.84

Synthetic data set 127.797.787.647.74

M162.2758.0863.5661.30

M2147.64163.62146.79152.69

P10004.956.497.726.39

0200.500.500.570.52

0400.760.700.650.70

0601.111.061.141.10

0800.610.570.670.62

1000.460.480.510.48

P200022.4722.9421.1222.18

0201.351.311.181.28

0400.650.730.710.70

0800.310.360.390.35

1000.470.430.490.46

Run times were obtained on an AMD Opteron 1.8 GHz dual-core system with 10 GB RAM, running Linux.

An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-395-6.jpg

Performance comparison of RANSAC aligner and Join aligner for 12 synthetic datasets. For each dataset, peak lists were aligned using the RANSAC aligner and the Join aligner with three different retention time tolerance thresholds (50 s, 20 s, and 5 s). Plot shows the average recall and precision values for all datasets. Error bars indicate standard deviations.

Our second approach for the comparison was to use the real proteomic (P1 and P2) and metabolomic (M1 and M2) datasets introduced by Lange et al. [11], together with their tables of "ground truth" alignments and an evaluation script for calculating the alignment precision and recall values. We applied the MZmine 2 Join and RANSAC aligners to align all the datasets with the parameters specified in Table Table1.1. Run times of the RANSAC aligner are reported in Table Table2.2. Precision and recall values were calculated using the provided evaluation script and compared to already published results in Table Table3.3. We used the latest available evaluation results published at http://msbi.ipb-halle.de/msbi/caap at the time of writing. Compared to the Join aligner, the RANSAC aligner provided better results in 11 of 13 alignments, with worse results obtained in only a single case (P2 dataset fraction 00). We assume that the high number of features in this fraction (over 6800 rows after alignment) made it somewhat difficult for the RANSAC algorithm to build a suitable model. Notably, in all fractions of dataset P1, the RANSAC aligner provided the best results among all the tested algorithms. Complete datasets P1, P2, M1, and M2, as well as all alignment results, are available online (see Dataset download).

Table 3

Performance comparison of MZmine 2 alignment methods (right side of the table) to previously published results (left side of the table) obtained using several different software packages [11].

Results published by Lange et al. (2008), avaiable at the time of writing at http://msbi.ipb-halle.de/msbi/caapMZmine 2 results

msInspectMZmine
(version 0.6)
OpenMSSpecArrayXAlignXCMS

without RT correctionWith correctionJoin alignerRANSAC aligner
Proteomics data set P1

fraction 00

Recall0.520.810.860.610.820.720.620.800.86

Precision0.380.810.860.610.820.540.580.800.86

fraction 20

Recall0.560.900.920.620.850.880.810.900.93

Precision0.450.900.920.620.850.840.800.900.93

fraction 40

Recall0.630.900.940.750.870.920.810.870.94

Precision0.480.900.940.750.870.850.800.870.94

fraction 60

Recall0.730.840.960.710.870.910.780.890.97

Precision0.540.840.960.710.870.800.750.890.97

fraction 80

Recall0.700.940.960.740.900.940.890.940.97

Precision0.570.940.960.740.900.880.880.940.97

fraction 100

Recall0.820.940.940.770.960.950.960.950.96

Precision0.560.940.940.770.960.890.960.950.96

Proteomics data set P2

fraction 00

Recall0.230.620.770.070.650.700.580.630.56

Precision0.070.490.650.050.490.310.440.530.49

fraction 20

Recall0.670.870.920.570.840.890.860.810.93

Precision0.240.710.770.420.700.550.660.690.78

fraction 40

Recall0.440.790.760.600.710.720.720.740.78

Precision0.260.760.740.410.690.560.690.730.77

fraction 80

Recall0.730.600.800.650.580.640.490.610.61

Precision0.340.560.700.440.560.500.450.580.61

fraction 100

Recall0.820.800.900.630.850.950.850.850.88

Precision0.390.640.750.440.690.650.690.710.75

Metabolomics data sets

M1

Recall0.270.920.87-0.880.980.940.900.91

Precision0.460.730.69-0.700.600.700.740.74

M2

Recall0.230.980.93-0.930.970.980.980.98

Precision0.470.840.79-0.790.580.780.830.83

Conclusions

The development of MZmine 2 was motivated by the need for a flexible and modular software platform that would allow the bioinformatic and analytical community to contribute new methods for specific stages of MS-based data processing. Great emphasis was placed on achieving the three main goals of a flexible, extendable, and modular design; user-friendly graphic interface; and good support for high-resolution MS data. The authors of this manuscript work in the field of metabolomics utilizing an LC-MS analytical platform, and therefore the currently developed modules were tested mainly on LC-MS data. The flexibility of MZmine 2, however, allows for easy expansion to other dataset types such as gas chromatography-MS, as well as interoperation with popular proteomics search engines such as MASCOT.

Several other software packages have been introduced for LC-MS based data processing, such as XCMS2 [22], Trans Proteomic Pipeline [23], Trequips [24], OpenMS-TOPP [25], and ProteoWizard [26]. None of these tools, however, share the same goals with MZmine 2, most of them being command-line oriented with fixed feature sets, aiming specifically for either proteomic or metabolomic research. Rather then a single piece of software, the developmental aim of MZmine 2 is to create a universal platform through which researchers can contribute individual processing modules and implement and share novel ideas, spanning over multiple research fields and analytical methods.

MZmine 2 is available for download at the project WWW site, together with a printable manual, an animated tutorial, a module development tutorial, and further relevant project information such as a source code repository and developers' mailing list. The current version of the framework is already suitable for processing large batches of data, both for targeted and/or non-targeted analyses, and has been applied in metabolomic research [13,27].

Dataset download

The data associated with this manuscript may be downloaded from ProteomeCommons.org Tranche using the following hash:

equation image

The hash may be used to validate the files were published as part of this manuscript's dataset, and to check that the data have not changed since publication.

Availability and requirements

Project name: MZmine 2

Project home page: http://mzmine.sourceforge.net

Operating system(s): Platform independent

Programming language: Java

Other requirements: Java Runtime Environment (JRE) 1.6, Java3D

License: GNU GPL

Abbreviations

AW: Alignment window; LC-MS: Liquid chromatography-mass spectrometry; MS: Mass spectrometry; RANSAC: Random sample consensus; RW: RANSAC window

Authors' contributions

TP designed the data model and overall architecture of the MZmine 2 framework and implemented most of the raw data visualization and peak identification modules. SC implemented the project serialization and RANSAC aligner. AVB implemented the peak detection module with previews, scatter plot and histogram visualizers, and isotope pattern support, and contributed to the online database search module development. MO participated in software testing and provided feedback on the framework design. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1:

Numerical values for Figure Figure6.6. Precision and recall values of RANSAC and Join aligner results for 12 synthetic data sets.

Acknowledgements

We thank the present and past MZmine 2 contributors Mikko Katajamaa, Yosuke Kawasaki, Jarkko Miettinen, John Rush, Marco Schaerfke, and Sasha Tkachev. We also thank the Okinawa Institute of Science and Technology Promotion Corporation for providing the funding and Mitsuhiro Yanagida for supporting the MZmine 2 development in his laboratory. We are very grateful to the developers of open-source libraries such as JFreeChart, VisAD, Jmol, and CDK. This work was in part supported by the EU-funded project ETHERPATHS (FP7-KBBE-222639, http://www.etherpaths.org/).

References

  • Katajamaa M, Oresic M. Data processing for mass spectrometry-based metabolomics. J Chromatogr A. 2007;1158(1-2):318–328. 10.1016/j.chroma.2007.04.021. [Abstract] [CrossRef] [Google Scholar]
  • Orchard S, Hoogland C, Bairoch A, Eisenacher M, Kraus HJ, Binz PA. Managing the data explosion. A report on the HUPO-PSI Workshop. August 2008, Amsterdam, The Netherlands. Proteomics. 2009;9(3):499–501. 10.1002/pmic.200800838. [Abstract] [CrossRef] [Google Scholar]
  • Katajamaa M, Miettinen J, Oresic M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 2006;22(5):634–636. 10.1093/bioinformatics/btk039. [Abstract] [CrossRef] [Google Scholar]
  • Katajamaa M, Oresic M. Processing methods for differential analysis of LC/MS profile data. BMC Bioinformatics. 2005;6:179. 10.1186/1471-2105-6-179. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Laaksonen R, Katajamaa M, Paiva H, Sysi-Aho M, Saarinen L, Junni P, Lutjohann D, Smet J, Van Coster R, Seppanen-Laakso T, Lehtimäki T, Soini J, Oresic M. A systems biology strategy reveals biological pathways and plasma biomarker candidates for potentially toxic statin-induced changes in muscle. PLoS One. 2006;1:e97. 10.1371/journal.pone.0000097. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Oresic M, Simell S, Sysi-Aho M, Nanto-Salonen K, Seppanen-Laakso T, Parikka V, Katajamaa M, Hekkala A, Mattila I, Keskinen P, Yetukuri L, Reinikainen A, Lähde J, Suortti T, Hakalax J, Simell T, Hyöty H, Veijola R, Ilonen J, Lahesmaa R, Knip M, Simell O. Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes. J Exp Med. 2008;205(13):2975–2984. 10.1084/jem.20081800. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Gopalacharyulu PV, Velagapudi VR, Lindfors E, Halperin E, Oresic M. Dynamic network topology changes in functional modules predict responses to oxidative stress in yeast. Mol Biosyst. 2009;5(3):276–287. 10.1039/b815347g. [Abstract] [CrossRef] [Google Scholar]
  • Medina-Gomez G, Gray SL, Yetukuri L, Shimomura K, Virtue S, Campbell M, Curtis RK, Jimenez-Linan M, Blount M, Yeo GS, Lopez M, Seppänen-Laakso T, Ashcroft FM, Oresic M, Vidal-Puig A. PPAR gamma 2 prevents lipotoxicity by controlling adipose tissue expandability and peripheral lipid metabolism. PLoS Genet. 2007;3(4):e64. 10.1371/journal.pgen.0030064. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Kind T, Tolstikov V, Fiehn O, Weiss RH. A comprehensive urinary metabolomic approach for identifying kidney cancer. Anal Biochem. 2007;363(2):185–195. 10.1016/j.ab.2007.01.028. [Abstract] [CrossRef] [Google Scholar]
  • Timischl B, Dettmer K, Kaspar H, Thieme M, Oefner PJ. Development of a quantitative, validated capillary electrophoresis-time of flight-mass spectrometry method with integrated high-confidence analyte identification for metabolomics. Electrophoresis. 2008;29(10):2203–2214. 10.1002/elps.200700517. [Abstract] [CrossRef] [Google Scholar]
  • Lange E, Tautenhahn R, Neumann S, Gropl C. Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics. 2008;9(1):375. 10.1186/1471-2105-9-375. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E. The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J Chem Inf Comput Sci. 2003;43(2):493–500. [Abstract] [Google Scholar]
  • Pluskal T, Nakamura T, Villar-Briones A, Yanagida M. Metabolic profiling of the fission yeast S. pombe: quantification of compounds under different temperatures and genetic perturbation. Mol Biosyst. 2010;6(1):182–198. 10.1039/b908784b. [Abstract] [CrossRef] [Google Scholar]
  • Tautenhahn R, Bottcher C, Neumann S. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics. 2008;9:504. 10.1186/1471-2105-9-504. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009. pp. W623–633. [Europe PMC free article] [Abstract] [CrossRef]
  • Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. 10.1093/nar/28.1.27. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Smith CA, O'Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, Custodio DE, Abagyan R, Siuzdak G. METLIN: a metabolite mass spectral database. Ther Drug Monit. 2005;27(6):747–751. 10.1097/01.ftd.0000179845.53213.39. [Abstract] [CrossRef] [Google Scholar]
  • Wishart DS, Knox C, Guo AC, Eisner R, Young N, Gautam B, Hau DD, Psychogios N, Dong E, Bouatra S, Mandal R, Sinelnikov I, Xia J, Jia L, Cruz JA, Lim E, Sobsey CA, Shrivastava S, Huang P, Liu P, Fang L, Peng J, Fradette R, Cheng D, Tzur D, Clements M, Lewis A, De Souza A, Zuniga A, Dawe M, Xiong Y, Clive D, Greiner R, Nazyrova A, Shaykhutdinov R, Li L, Vogel HJ, Forsythe I. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 2009. pp. D603–610. [Europe PMC free article] [Abstract] [CrossRef]
  • Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [Abstract] [CrossRef] [Google Scholar]
  • Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm Of the ACM. 1981;24:381–395. 10.1145/358669.358692. [CrossRef] [Google Scholar]
  • Cleveland WS, Devlin SJ. Locally weighted regression - an approach to regression-analysis by local fitting. J Am Stat Assoc. 1988;83(403):596–610. 10.2307/2289282. [CrossRef] [Google Scholar]
  • Benton HP, Wong DM, Trauger SA, Siuzdak G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal Chem. 2008;80(16):6382–6389. 10.1021/ac800795f. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Keller A, Eng J, Zhang N, Li XJ, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol. 2005;1:2005.0017. 10.1038/msb4100024. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Gehlenborg N, Yan W, Lee IY, Yoo H, Nieselt K, Hwang D, Aebersold R, Hood L. Prequips--an extensible software platform for integration, visualization and analysis of LC-MS/MS proteomics data. Bioinformatics. 2009;25(5):682–683. 10.1093/bioinformatics/btp005. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Kohlbacher O, Reinert K, Gropl C, Lange E, Pfeifer N, Schulz-Trieglaff O, Sturm M. TOPP--the OpenMS proteomics pipeline. Bioinformatics. 2007;23(2):e191–197. 10.1093/bioinformatics/btl299. [Abstract] [CrossRef] [Google Scholar]
  • Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008;24(21):2534–2536. 10.1093/bioinformatics/btn323. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
  • Oresic M, Seppanen-Laakso T, Yetukuri L, Backhed F, Hanninen V. Gut microbiota affects lens and retinal lipid composition. Exp Eye Res. 2009;89(5):604–607. 10.1016/j.exer.2009.06.018. [Abstract] [CrossRef] [Google Scholar]

Articles from BMC Bioinformatics are provided here courtesy of BMC

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/2275986
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/2275986

Article citations


Go to all (1,606) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles 


Funding 


Funders who supported this work.

European Commission FP7 (1)