About
GERMLINE is an algorithm for discovering long shared segments of Identity by Descent (IBD) between pairs of individuals in a large population. It takes as input genotype or haplotype marker data for individuals (as well as an optional known pedigree) and generates a list of all pairwise segmental sharing.
GERMLINE uses a novel hashing & extension algorithm which allows for segment identification in haplotype data in time proportional to the number of
individuals. Presently, GERMLINE can execute on phased or un-phased data; though we have found performance much improved with phasing while phasing &
running GERMLINE is still significantly faster than comparable IBD algorithms. Utilities for easily phasing data for GERMLINE are available below. GERMLINE can identify shared segments of any specified length, as well as allow for any number of mismatching markers.
The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia
University. It has been built in C++ and tested in the Red Hat Linux environment; the source is available here in a tar.gz package as well as pre-compiled binaries under
the utilities section. GERMLINE is distributed under the GPL license.
If you use GERMLINE in a published analysis, please cite Gusev
A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe'er I (2008) Whole population, genomewide mapping of hidden relatedness.
Genome Research.
This work has recently been applied to optimally selecting individuals for sequencing and inferring
previously un-typed variants in Low-pass Genomewide Sequencing and Variant Imputation Using
Identity-by-descent in an Isolated Human Population. (2011) A Gusev, MJ Shah, EE Kenny, A Ramachandran,
JK Lowe, J Salit, CC Lee, EC Levandowsky, TN Weaver, QC Doan, HE Peckham, SF McLaughlin, MR Lyons, VN Sheth, M
Stoffel, FM De La Vega, JM Friedman, JL Breslow, I Pe'er (in submission, pre-pub version).
Usage
From the command line, extract germline with tar xzvf germline-X-X-X.zip, enter the extracted directory, and compile germline with make all. A simple test-case using shortened HapMap samples can be run using make test. The executable is run as germline <options> which prompts the user for input/output file information and runs the algorithm.
Input
GERMLINE accepts as input the following formats:
- [ doc ] Plink / ped+map
- [ doc ] PHASE / HapMap
NOTE: Although the PLINK format is not intended for haplotypes, GERMLINE expects the respective alleles to appear in
order; i.e. the first allele always corresponds to one haplotype and the second allele to the other. Also, PLINK arbitrarily re-orders the
alleles in processing the files, so we do not recommend handling phased data with PLINK prior to GERMLINE analysis because the haplotypes
may not be intact (use the -from_snp and -to_snp flags to target specific regions).
Output
Upon completion, GERMLINE generates a .match and .log file in the specified location. Each line in the .match file corresponds to a pairwise shared segment, with the following fields:
- Family ID 1
- Individual ID 1
- Family ID 2
- Individual ID 2
- Chromosome
- Segment start (bp)
- Segment end (bp)
- Segment start (SNP)
- Segment end (SNP)
- Total SNPs in segment
- Genetic length of segment
- Units for genetic length (cM or MB)
- Mismatching SNPs in segment
- 1 if Individual 1 is homozygous in match; 0 otherwise
- 1 if Individual 2 is homozygous in match; 0 otherwise
Binary Output
To spave space GERMLINE can also generate binary output using the -bin_out flag. This flag will generate three files:
- *.bsid Two columns per line for each sample: FAM ID,SAMPLE ID.
- *.bmid Four columns per line for each marker: CHROMOSOME,RSID,GENETIC DISTANCE,PHYSICAL DISTANCE.
- *.bmatch Binary match file containing integer pointers to samples (from bsid file), markers (from bmid file) and boolean meta-data.
The binary files can be converted back to the standard flat format described above by using the parse_bmatch utility provided with the code. Load the three generated files using
parse_bmatch [BMATCH FILE] [BSID FILE] [BMID FILE] and the flat match output will be printed to standard out. See the parse_bmatch.cpp code for binary format details.
Options
The program has several command line options to direct the segmental sharing process:
Flag | Default | Description |
---|
-map | - | File location for genetic distance map. Uses the PLINK map format. |
-min_m | 3 | Minimum length for match to be used for imputation (in cM or MB). |
-err_hom | 2 | The maximum number of mismatching homozygous markers for a slice to still be considered part of a match. |
-err_het | 0 | The maximum number of mismatching heterozygous markers for
a slice to
still be considered part of a match. |
-from_snp | - | Indicate the ID of the first SNP to start processing from. |
-to_snp | - | Indicate the ID of the last SNP to end processing with. |
-h_extend | | Extends from exact seeds using haplotypes rather than genotypes; useful when
data is well-phased (e.g. trios) |
-homoz | | Allow self matches (runs of homozygosity) |
-homoz-only | | Analyze and report only auto/homo-zygous segments, no IBD reported but significantly faster analysis. |
-haploid | | Treat each input individual as two distinct and separate haplotypes. Output IDs will have .0/.1 suffix corresponding to each haplotype. The -err_het flag will have no effect in this analysis. |
-bin_out | | Generate output matches in binary format, creates a *.bmatch *.bsid and *.bmid files. These files can be converted to flat output using the parse_bmatch utility included and compiled in the package. |
-bits | 128 | Size of each slice (in markers) used for exact matching seeds. |
-w_extend | | Extend the match beyond the slice end to the first mismatching marker. |
Utilities
We have created some script utilities for converting between data formats; the source code is available below. All scripts can
be compiled using g++ [script file] -o [output name].
Title | Usage | Download |
Phasing Pipeline |
Pipeline for phasing PLINK format data with BEAGLE and processing in GERMLINE. README for detailed usage. |
phasing_pipeline.tar.gz |
Binaries |
Pre-compiled binaries of GERMLINE v1.5.0 for Linux 32/64 bit and Windows (cygwin). |
32b, 64b, WIN |
Contact
For any questions or comments, please the developers directly at: {gusev,itsik}@cs.columbia.edu.
Change Log
1.5.1 (03.07.12)
Fixed minor formatting bug with runs of homozygosity
1.5.0 (09.17.10)
Major computational overhaul - algorithm should run 2-3 times faster with 4-fold memory reduction on large datasets.
Added a binary output option (flag -bin_out) to reduce output size, see documentation.
Added a stand-alone C++ script to convert from binary file to standard readable match file.
Added -haploid flag to treat each input individual as two distinct haplotypes, output IDs will have .0/.1 suffix indicating which
respective haplotype the match is on. This is effective for short windows with very well phased data.
1.4.2 (07.22.10)
Fixed bug with -to_snp/-from_snp commands
Added -homoz-only flag for simple homozygosity analysis
1.4.1 (03.22.10)
Fixed bug in PHASE format input
Fixed bug in files with multiple chromosomes
Fixed bug with overlapping segments from -h_extend feature
1.4.0 (08.14.09)
Allows self-matches between individuals (include -homoz flat)
Added columns to indicate weather match is homozygous or heterozygous
1.3.0 (12.22.08)
Now handles unphased data (omit -h_extend flag).
Added options for homozygous / heterozygous in-exact matching.
Output now specifies if match is in cM or MB.
1.2.1 (09.17.08)
Now using the boost dynamic_bitset libraries. These are packaged with the source and do not effect installation/dependency.
Added -bits flag to explicitly define word-length.
Included sample input data & test-case called upon compilation by 'make test_case'.
1.2.0 (09.03.08)
Output format has changed to provide more detailed SNP information (see above).
Can now iteratively process multi-chromosomal data (for PLINK / PED format only).
Genotype calling has been removed for the time being.
Genetic map restructured (see above) and processed as a parameter.
1.0.2 (08.12.08)
Updated the HapMap format input - auto-detection of trio or unrelated input.
1.0.1 (06.09.08)
Added options to perform analysis on specific region (see -from_snp, -to_snp flags).
Added option to print haplotypes and matches (see: -haps, -print flags).