GitHub - csmiller/EMIRGE: EMIRGE reconstructs full length ribosomal genes from short read sequencing data.

csmiller / EMIRGE Public

Notifications You must be signed in to change notification settings
Fork 29
Star 37

EMIRGE reconstructs full length ribosomal genes from short read sequencing data.

37 stars 29 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
pykseq		pykseq
utils		utils
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.travis.yml		.travis.yml
README.txt		README.txt
_emirge.pyx		_emirge.pyx
_emirge_C.h		_emirge_C.h
_emirge_amplicon.pyx		_emirge_amplicon.pyx
emirge.py		emirge.py
emirge_amplicon.py		emirge_amplicon.py
emirge_makedb.py		emirge_makedb.py
emirge_rename_fasta.py		emirge_rename_fasta.py
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

EMIRGE: Expectation-Maximization Iterative Reconstruction of Genes from the Environment
Copyright (C) 2010-2016 Christopher S. Miller ([email protected])

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>

https://github.com/csmiller/EMIRGE

EMIRGE reconstructs full length ribosomal genes from short read
sequencing data. In the process, it also provides estimates of the
sequences' abundances.

EMIRGE uses a modification of the EM algorithm to iterate between
estimating the expected value of the abundance of all SSU sequences
present in a sample and estimating the probabilities for each read
that a specific sequence generated that read. At the end of each
iteration, those probabilities are used to re-calculate (correct) a
consensus sequence for each reference SSU sequence, and the mapping is
repeated, followed by the estimations of probabilities. The
iterations should usually stop when the reference sequences no longer
change from one iteration to the next. Practically, 40-80 iterations is
usually sufficient for many samples. Right now EMIRGE uses Bowtie
alignments internally, though in theory a different mapper could be
used.

EMIRGE was designed for Illumina reads in FASTQ format, from pipeline
version >= 1.3

There are two versions of EMIRGE:

1. emirge.py -- this version was designed for metagenomic data
2. emirge_amplicon.py -- this version was designed for PCR amplicon data, and can handle up to a few million reads where the entire sequencing allocation is devoted to a single gene. In theory it could also be used for RNASeq data where rRNA makes up a large percentage of the reads.

CITATIONS
------------------------------

If you use EMIRGE in your work, please cite these manuscripts as appropriate.

Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF (2011) EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome biology 12: R44. doi:10.1186/gb-2011-12-5-r44.

Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF (2013) Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments. PloS one 8: e56018. doi:10.1371/journal.pone.0056018.

DEPENDENCIES
------------------------------

EMIRGE expects the following programs to be installed and available in your path:

-python (tested with version 2.6), with the following packages installed:
-BioPython
-Cython
-pysam
-scipy / numpy
-usearch (www.drive5.com/usearch/ -- tested with usearch version 6.0.203; versions earlier than this are incompatible).
-samtools (http://samtools.sourceforge.net/ -- tested with verison 0.1.18)
-bowtie (http://bowtie-bio.sourceforge.net/index.shtml -- tested with version 0.12.7 and 0.12.8)
-vsearch (https://github.com/torognes/vsearch -- optional, but required if you use emirge_makedb.py)
INSTALLATION
------------------------------

After installing the dependencies listed above, type the following to build EMIRGE:

$ python setup.py build

To install (you may skip straight to this step), type the following as root, or with sudo:

$ python setup.py install

You can also type the following for more options:
$ python setup.py --help install

For example, to install to a location in your home directory where you
have permission to write, you might type something like:

$ python setup.py install --prefix=$HOME/software

HELP
------------------------------

There is a google group (similar to a mailing list) for asking questions
about EMIRGE:
https://groups.google.com/group/emirge-users

Also, there is some additional information (including a Frequently
Asked Questions section) on the github wiki:
https://github.com/csmiller/EMIRGE/wiki

Although I encourage use of the google group due to increased volume
of support emails, please feel free to contact me directly
([email protected]) with any problems, bug reports, or questions

At the moment, there is no manual, though running the following is helpful:

emirge.py --help

EMIRGE OUTPUT
------------------------------

Once an EMIRGE run is completed, run emirge_rename_fasta.py on the
final iterations directory, for example:

emirge_rename_fasta.py iter.40 > renamed.fasta

Also see:

emirge_rename_fasta.py --help

Running emirge_rename_fasta.py will provide you with a fasta file with
EMIRGE output. Dissecting a single example header:

>3326|AF427479.1.1480_m01 Prior=0.000367 Length=1480 NormPrior=0.000414
1 2 3 4 5 6

1. The internal EMIRGE ID -- unique for each sequence
2. The accession number of the starting candidate sequence
3. an optional suffix indicating this sequence was split out from another due to evidence in the mapping reads of 2 or more "strains."
4. The Prior, or abundance estimate (used in original publication)
5. The length of the sequence
6. The length-normalized abundance estimate (anecdotally, this is sometimes more accurate if there are lots of different sequence lengths)

CANDIDATE SSU DATABASE
------------------------------

You can download a standard candidate SSU database by running the
following command:
python emirge_makedb.py

This script is included with EMIRGE, and by default downloads the
current version of the Silva small subunit rRNA database
(http://www.arb-silva.de/). Sequences are clustered by default using
vsearch at 97% sequence identity, short and long sequences are
removed, and non-standard characters are changed to be within {ACTG}.
See emirge_makedb.py --help for details.

You can use any reference SSU database with emirge, though SILVA is
recommended. If you use another database, you should run
utils/fix_nonstandard_chars.py on your fasta file to remove ambiguous
characters, which bowtie cannot map against. You will also need to
first build a bowtie index, with something like:
$ bowtie-build SSU_candidate_db.fasta SSU_candidate_db_btindex

You might also consider changing the offrate (see
http://bowtie-bio.sourceforge.net/manual.shtml)

OTHER
------------------------------

** A note about single-end sequencing:

EMIRGE was designed for and tested on paired-end sequencing reads.
However, you can now use EMIRGE on single-end reads as well: simply
omit the -2 parameter. Although I have done some basic testing on
single-end reads, runs with single reads have NOT been as extensively
tested as runs with paired reads. Please let me know how it works for
you if you try EMIRGE with single-end reads.