Program to check a QC'd plink .bim file
against the HRC, 1000G or CAAPA reference SNP list in advance of
imputation
Downloads
The version below is designed for use in non interactive session e.g. a high performance cluster,
all other versions require the terminal to be open for the duration of the program run.
HRC-1000G-check-bim-v4.2.13-NoReadKey.zip
- Changes V4.2.1 to V4.2.2
- Added two new options for allele
frequency thresholds (-t <difference>, -n)
- -t 0.3 sets the allele difference threshold to 0.3, the default if not set is 0.2.
Use this to change the allele frequency difference used to
exclude SNPs in the final file, range 0 - 1, the larger the
difference the fewer SNPs that will be excluded.
- -n flag to specify that you do not wish to exclude any SNPs
based on allele frequency difference, if -n is used -t has no
effect.
- Changes V4.2.2 to V4.2.3
- Fixed bug whereby SNPs that were incorrectly mapped in the bim
file were updated only for position and not chromosome.
- Changes V4.2.3 to V4.2.4
- Added support for X chromosome in HRC release v1.1
- Changes V4.2.4 to V4.2.5
- Fixed bug with Chromosome X being left out of plink command file
- Changes V4.2.5 to V4.2.6
- Added the ability to use gzipped reference panels
- Implemented a check to ensure the same number of variants are
present in the .bim and .frq files
- Changes V4.2.6 to V4.2.7
- Minor update to the usage information display
- Changes V4.2.7 to V4.2.8
- Added new flag -c to specify checking individual chromosome(s) rather than assuming genome wide
- Changes V4.2.8 to V4.2.9
- Update to allow reading of bim files with allele codes 1,2,3,4
- Fixed a minor bug that resulted in a warning of a null entry if there were no differences between the bim file and the reference
- Changes V4.2.9 to V4.2.10
- Update to check path and name of bim and frequency file are correct and report meaningfully if not
- Changes V4.2.10 to V4.2.11
- Update to plink commands to ensure Ref/Alt are set correctly and retained
- Changes V4.2.11 to V4.2.13
- Added -a flag to disable automatic removal of palindromic SNPs with MAF > 0.4
- Updated to now read bgzipped reference panels, in addition to the gzipped/unizipped versions
- Added -l to specify the path to the plink executable to use, useful to change between plink versions, defaults to plink if the file specified is not found
- Added -o to specify/create a directory, and any intermediate directories, to store all the output files
- Updated the plink run script to use absolute paths to avoid issues on running
- Changes V4.2.13 to V4.3.0
- Updated the program to rework the reference panel loading so as to reduce runtime memory usage and time
- It is highly recommended to use this version for TOPMed, previous versions will also work with TOPMed but will require >300GB RAM
- This version still works with all reference panels listed below again with the reduction in memory usage
Summary of checks and outputs:
- Checks:
- Strand, alleles, position, Ref/Alt assignments and
frequency differences.
In addition to the reference file v4 and above require the plink
.bim and (from the plink --freq command) .frq files.
- Produces:
- A set of plink commands to update or remove SNPs (see
below and changes to V4.2.2) based on the checks
as well as a file (FreqPlot) of cohort allele frequency vs
reference panel allele frequency.
- Updates:
- Strand, position, ref/alt assignment
- Removes:
- A/T & G/C SNPs if MAF > 0.4, SNPs
with differing alleles, SNPs with > 0.2 allele frequency
difference (can be removed/changed in V4.2.2), SNPs not in
reference panel
Usage with HRC reference panel:
Requires the unzipped tab delimited HRC reference (currently
v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab) from the Haplotype
Reference Consortium Website here:
http://www.haplotype-reference-consortium.org/site
Usage: perl HRC-1000G-check-bim.pl -b <bim
file> -f <Frequency file> -r <Reference panel>
-h
Usage with the TOPMed reference panel
NOTE: It is recommended to use
v4.3.0 for the TOPMed panel, previous versions will work with the TOPMed panel but
will require around 300GB RAM.
v4.3.0 reduces the memory usage substantially and this version can be used with all
other reference panels as well, with a corresponding reduction in memory usage.
The TOPMed reference panel is not available for direct download from this site, it needs to be created from the
VCF of dbSNP submitted sites (currently ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz).
This can be downloaded from the Bravo Website
https://bravo.sph.umich.edu/freeze5/hg38/
Once downloaded the VCF can be converted to an HRC formatted reference legend using the code here:
CreateTOPMed.zip
Usage: ./CreateTOPMed.pl -i ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz
By default this will create a file filtered for variants flagged as PASS only, if you wish to use all variants the
-a flag overrides this.
To override the default output file naming use
-o filename.
To run the checks the usage is the same as for the HRC panel, namely using the flags
-h -r PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim
file> -f <Frequency file> -r <Reference panel> -h
Usage with 1000G reference panel
Requires the unzipped 1000G legend file (instructions to create
this are below) or download here (1.4G in size):
1000GP_Phase3_combined.legend.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim file>
-f <Frequency file> -r <Reference panel> -g -p
<population>
Population can be one of the following:
ALL All samples
AFR African
AMR Ad Mixed American
EAS East Asian
EUR European
SAS South Asian
1000G population will default to
ALL if not specified.
Usage with CAAPA reference panel
The CAAPA reference panel legend can be downloaded here (513MB):
all.caapa.sorted.zip
Many thanks to Kathleen Barnes and Michelle Daya of CAAPA for
sharing this and also to Margaret Parker and Michael Cho for the
initial reformatting
Usage is the same as for the HRC panel, namely using the flags
-h -r all.caapa.sorted.txt
Usage: perl HRC-1000G-check-bim.pl -b <bim
file> -f <Frequency file> -r all.caapa.sorted.txt -h
Usage with Asia reference panel
The Asia Genome reference panel legend can be downloaded here (912MB):
ASIA.Genome.Reference.legend.zip
Usage is the same as for the 1000G panel, namely using the flags
-g -r ASIA.Genome.Reference.legend
Usage: perl HRC-1000G-check-bim.pl -b <bim
file> -f <Frequency file> -r ASIA.Genome.Reference.legend -g -p <population>
Population can one of the following:
ALL All Samples
OCE Oceania
NEA North East Asian
SEA South East Asian
SAS South Asian
AFR African
AMR American
WER West Eurasia
The population will default to
ALL if not specified
1000G reference panel file
The reference file for 1000G does not exist as one file,
some steps are necessary to create it,
you will need the legend files from the 1000GP_phase3.tgz file
on the impute web site:
https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html
Download the .tgz and extract all .legend files, place them in a
directory together with the following script:
concatenate-1000GP.zip
and run the script (perl concatenate-1000GP.pl)
this will create a file (1000GP_Phase3_combined.legend) suitable
for use with the checking program
Original script
The original script for HRC only checking can still be
downloaded here, but I recommend using one V4.2.1 or above
as there are more features and a number of issues that are
corrected in the most recent versions:
HRC-check-bim.zip
Perl script to extract a subset of data
from the UKBioBank phenotype data download.
Requires the Stata .dct and R .tab files to run.
Run without any arguments to get a description of usage.
Download:
Version 1.0
extract.zip
Version 1.2.2
extract-v1.2.2.zip
Phenotype File Summary and Fingerprinting
This program was developed to give an
overview of phenotypes (columns) in a tab delimited text
file, to do this it attempts to determine the type of each
column (numeric, text, mixed text/numeric or coded
variables) and summarise it appropriately.
Downloads
- v3.1
- phenotype_qc.zip
- v4.1
- Updated to better accommodate the missing column names in the
dct file.
- phenotype_qc.v4.1.zip
Usage:
Version 3.1
perl phenotype_qc.pl -f <phenotypeFile> [-m
<missingValueIdentifier> -i <sampleIDcolumn> -c
<column> -s <columnFile> -l <headerLookup> -h]
Version 4.1
Version 4.1 introduces a new method
for extracting the data from the phenotype file, this is
necessary as for large data sets (>5GB) where v3.1
would not perform well or at all. This version works on
data sets of any size however it will only work in a
Unix/Linux environment (Cygwin possible but not tested) as
it requires the UNIX cut command to extract the data. The
option -k specifies how many columns at a time to extract
using cut, the default is 25 which is a trade off between
the number of calls to cut vs the amount of memory
required. If you have problems with memory usage set this
a smaller number. Higher numbers are possible but on test
sets to date the time to complete a run increases,
probably due to memory allocation.
The second new option (-p) is to plot all the numeric
phenotypes, this allows a quick visual check of the
distribution of values. Running this version, with or
without plotting will still require GD::Graph to be
installed, for instructions on how to do this see here: GD::Graph
v4.1 Usage:
phenotype_qc.pl -f <phenotypeFile> [-m
<missingValueIdentifier> -i <sampleIDcolumn>
-c <column> -s <columnFile> -l
<headerLookup> -h -p -k <#columns per chunk>]
Phenotype QC program command line options:
-f |
<phenotypeFile> |
A tab delimited phenotype file,
all rows should contain the same number of columns. |
-i |
<sampleIDcolumn> |
Set the column that contains the
sample ids, column numbers start from 0, default is
0. |
-m |
<missingValueIdentifier> |
Set the missing value identifier,
default is NA. |
-c |
<column> |
Specify a single column to check,
if using a header file this can be either the human
readable or original header. |
-s |
<columnFile> |
Provide a file containing a list
of columns to extract, one per row. As with the
single column checking can be either header, takes
precedence over -c. |
-l |
<headerLookup> |
Provide an optional header file
for the data set, for use where the headers within
the file are not human readable, requires two tab
delimited columns, the first column should contain
the ids that match the header in phenotypeFile.
|
-h |
|
Show the help message. |
|
Version 4.1 Only |
|
-k |
<#columns per chunk> |
Specify the number of columns to
extract in each chunk for processing, default if not
specified is 25. |
-p |
|
Specify to plot every column
identified as numeric, or numeric coded, default
is not to plot. |
- Numeric: this category reports:
- - Total number of non-missing variables.
- - Total number of missing entries.
- - Mean.
- - Median.
- - Standard deviation.
- - Number of entries that are either greater or
less than three standard deviations from the mean.
- Text,
Mixed,
and Coded
- These categories all report:
- - Total number of non-missing variables.
- - Total number of missing entries.
- - Total number of unique entries, and if the
number of unique entries is below 50 these are
displayed with counts for each entry.
- Coded:
- Entries in this category can be text coded,
numeric coded or mixed coded.
- A column is described as containing a coded set
of variables if the total number of unique values is
less than 0.1% of the total number of entries, or
20, whichever is the greater.
The latest versions (v3.1 & v4.1)
also produce a "fingerprint" summary of each row (assumed
sample) and column (assumed phenotype) these are designed to
verify whether a sample or phenotype has changed between
data releases.
The fingerprint files (prefixed with FP, fingerprint
phenotype, and FPS, fingerprint Sample) both contain the
following columns:
- Sample ID or column header.
- Total numeric values.
- Total non-numeric values.
- Total non missing values.
- Total unique values.
- Total missing values.
- Determined type.
These files can be used to determine whether a phenotype has
changed between data releases, a script to do this comparison is
under development and will be posted as soon as it is finalised.
Excel to Tab
A Perl wrapper using Spreadsheet::Read to convert various
spreadsheet formats (xls, xlsx, csv, ods) to tab delimited text
files.
Where the format supports sheets the output file
name will include the sheet name and there will be one file
per sheet.
Depending on your system you may need to add one or more of
the following modules to your Perl install:
- Spreadsheet::Read
- Spreadsheet::ReadSXC
- Spreadsheet::ParseExcel
- Spreadsheet::ParseXLSX
- Text::CSV_XS
Download
xls2tab.zip
Usage:
xls2tab -i <inputfile> [-o
<outputfilestem> -f]
-i
|
inputfile
|
Path and name of the input file
|
-o
|
outputfilestem
|
Optional, if supplied will form the stem
for all the output. For example using MyFile will lead to
MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2
sheets named Sheet1 and Sheet2). If not supplied the name
will be the same as the input, with the sheet name, if it
exists, and the file extension, .txt.
|
-f
|
|
Optional, tells the program to select the
formatted values from the cells, not the raw values.
Default is for raw values
|