McCarthy Tools

Program to check a QC'd plink .bim file against the HRC, 1000G or CAAPA reference SNP list in advance of imputation

Downloads

Version 4.2.1	HRC-1000G-check-bim-v4.2.zip
Version 4.2.2	HRC-1000G-check-bim.v4.2.2.zip
Version 4.2.3	HRC-1000G-check-bim.v4.2.3.zip
Version 4.2.4	HRC-1000G-check-bim.v4.2.4.zip
Version 4.2.5	HRC-1000G-check-bim.v4.2.5.zip
Version 4.2.6	HRC-1000G-check-bim-v4.2.6.zip
Version 4.2.7	HRC-1000G-check-bim-v4.2.7.zip
Version 4.2.8	HRC-1000G-check-bim-v4.2.8.zip
Version 4.2.9	HRC-1000G-check-bim-v4.2.9.zip
Version 4.2.10	HRC-1000G-check-bim-v4.2.10.zip
Version 4.2.11	HRC-1000G-check-bim-v4.2.11.zip
Version 4.2.13	HRC-1000G-check-bim-v4.2.13.zip
Version 4.3.0	HRC-1000G-check-bim-v4.3.0.zip

The version below is designed for use in non interactive session e.g. a high performance cluster, all other versions require the terminal to be open for the duration of the program run.
HRC-1000G-check-bim-v4.2.13-NoReadKey.zip

Changes V4.2.1 to V4.2.2: Added two new options for allele frequency thresholds (-t <difference>, -n); -t 0.3 sets the allele difference threshold to 0.3, the default if not set is 0.2. Use this to change the allele frequency difference used to exclude SNPs in the final file, range 0 - 1, the larger the difference the fewer SNPs that will be excluded.; -n flag to specify that you do not wish to exclude any SNPs based on allele frequency difference, if -n is used -t has no effect.
Changes V4.2.2 to V4.2.3: Fixed bug whereby SNPs that were incorrectly mapped in the bim file were updated only for position and not chromosome.
Changes V4.2.3 to V4.2.4: Added support for X chromosome in HRC release v1.1
Changes V4.2.4 to V4.2.5: Fixed bug with Chromosome X being left out of plink command file
Changes V4.2.5 to V4.2.6: Added the ability to use gzipped reference panels; Implemented a check to ensure the same number of variants are present in the .bim and .frq files
Changes V4.2.6 to V4.2.7: Minor update to the usage information display
Changes V4.2.7 to V4.2.8: Added new flag -c to specify checking individual chromosome(s) rather than assuming genome wide
Changes V4.2.8 to V4.2.9: Update to allow reading of bim files with allele codes 1,2,3,4; Fixed a minor bug that resulted in a warning of a null entry if there were no differences between the bim file and the reference
Changes V4.2.9 to V4.2.10: Update to check path and name of bim and frequency file are correct and report meaningfully if not
Changes V4.2.10 to V4.2.11: Update to plink commands to ensure Ref/Alt are set correctly and retained
Changes V4.2.11 to V4.2.13: Added -a flag to disable automatic removal of palindromic SNPs with MAF > 0.4; Updated to now read bgzipped reference panels, in addition to the gzipped/unizipped versions; Added -l to specify the path to the plink executable to use, useful to change between plink versions, defaults to plink if the file specified is not found; Added -o to specify/create a directory, and any intermediate directories, to store all the output files; Updated the plink run script to use absolute paths to avoid issues on running
Changes V4.2.13 to V4.3.0: Updated the program to rework the reference panel loading so as to reduce runtime memory usage and time; It is highly recommended to use this version for TOPMed, previous versions will also work with TOPMed but will require >300GB RAM; This version still works with all reference panels listed below again with the reduction in memory usage

Summary of checks and outputs:

Checks:: Strand, alleles, position, Ref/Alt assignments and frequency differences. In addition to the reference file v4 and above require the plink .bim and (from the plink --freq command) .frq files.
Produces:: A set of plink commands to update or remove SNPs (see below and changes to V4.2.2) based on the checks as well as a file (FreqPlot) of cohort allele frequency vs reference panel allele frequency.
Updates:: Strand, position, ref/alt assignment
Removes:: A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference (can be removed/changed in V4.2.2), SNPs not in reference panel

Usage with HRC reference panel:

Requires the unzipped tab delimited HRC reference (currently v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab) from the Haplotype Reference Consortium Website here:
http://www.haplotype-reference-consortium.org/site
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -h

Usage with the TOPMed reference panel

NOTE: It is recommended to use v4.3.0 for the TOPMed panel, previous versions will work with the TOPMed panel but will require around 300GB RAM.
v4.3.0 reduces the memory usage substantially and this version can be used with all other reference panels as well, with a corresponding reduction in memory usage.

The TOPMed reference panel is not available for direct download from this site, it needs to be created from the VCF of dbSNP submitted sites (currently ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz).
This can be downloaded from the Bravo Website https://bravo.sph.umich.edu/freeze5/hg38/

Once downloaded the VCF can be converted to an HRC formatted reference legend using the code here: CreateTOPMed.zip
Usage: ./CreateTOPMed.pl -i ALL.TOPMed_freeze5_hg38_dbSNP.vcf.gz

By default this will create a file filtered for variants flagged as PASS only, if you wish to use all variants the -a flag overrides this.
To override the default output file naming use -o filename.

To run the checks the usage is the same as for the HRC panel, namely using the flags -h -r PASS.Variants.TOPMed_freeze5_hg38_dbSNP.tab.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -h

Usage with 1000G reference panel

Requires the unzipped 1000G legend file (instructions to create this are below) or download here (1.4G in size):
1000GP_Phase3_combined.legend.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -g -p <population>
Population can be one of the following:
ALL All samples
AFR African
AMR Ad Mixed American
EAS East Asian
EUR European
SAS South Asian
1000G population will default to ALL if not specified.

Usage with CAAPA reference panel

The CAAPA reference panel legend can be downloaded here (513MB): all.caapa.sorted.zip
Many thanks to Kathleen Barnes and Michelle Daya of CAAPA for sharing this and also to Margaret Parker and Michael Cho for the initial reformatting
Usage is the same as for the HRC panel, namely using the flags -h -r all.caapa.sorted.txt
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r all.caapa.sorted.txt -h

Usage with Asia reference panel

The Asia Genome reference panel legend can be downloaded here (912MB): ASIA.Genome.Reference.legend.zip

Usage is the same as for the 1000G panel, namely using the flags -g -r ASIA.Genome.Reference.legend
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r ASIA.Genome.Reference.legend -g -p <population>
Population can one of the following:
ALL All Samples
OCE Oceania
NEA North East Asian
SEA South East Asian
SAS South Asian
AFR African
AMR American
WER West Eurasia
The population will default to ALL if not specified

1000G reference panel file

The reference file for 1000G does not exist as one file, some steps are necessary to create it, you will need the legend files from the 1000GP_phase3.tgz file on the impute web site:
https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html
Download the .tgz and extract all .legend files, place them in a directory together with the following script:
concatenate-1000GP.zip
and run the script (perl concatenate-1000GP.pl) this will create a file (1000GP_Phase3_combined.legend) suitable for use with the checking program

Original script

The original script for HRC only checking can still be downloaded here, but I recommend using one V4.2.1 or above as there are more features and a number of issues that are corrected in the most recent versions:
HRC-check-bim.zip

Perl script to extract a subset of data from the UKBioBank phenotype data download.
Requires the Stata .dct and R .tab files to run.
Run without any arguments to get a description of usage.

Download:

Version 1.0 extract.zip
Version 1.2.2 extract-v1.2.2.zip

Phenotype File Summary and Fingerprinting

This program was developed to give an overview of phenotypes (columns) in a tab delimited text file, to do this it attempts to determine the type of each column (numeric, text, mixed text/numeric or coded variables) and summarise it appropriately.

Downloads

v3.1: phenotype_qc.zip
v4.1: Updated to better accommodate the missing column names in the dct file.; phenotype_qc.v4.1.zip

Usage:

Version 3.1

perl phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h]

Version 4.1

Version 4.1 introduces a new method for extracting the data from the phenotype file, this is necessary as for large data sets (>5GB) where v3.1 would not perform well or at all. This version works on data sets of any size however it will only work in a Unix/Linux environment (Cygwin possible but not tested) as it requires the UNIX cut command to extract the data. The option -k specifies how many columns at a time to extract using cut, the default is 25 which is a trade off between the number of calls to cut vs the amount of memory required. If you have problems with memory usage set this a smaller number. Higher numbers are possible but on test sets to date the time to complete a run increases, probably due to memory allocation.
The second new option (-p) is to plot all the numeric phenotypes, this allows a quick visual check of the distribution of values. Running this version, with or without plotting will still require GD::Graph to be installed, for instructions on how to do this see here: GD::Graph

v4.1 Usage:

phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h -p -k <#columns per chunk>]

Phenotype QC program command line options:

-f	<phenotypeFile>	A tab delimited phenotype file, all rows should contain the same number of columns.
-i	<sampleIDcolumn>	Set the column that contains the sample ids, column numbers start from 0, default is 0.
-m	<missingValueIdentifier>	Set the missing value identifier, default is NA.
-c	<column>	Specify a single column to check, if using a header file this can be either the human readable or original header.
-s	<columnFile>	Provide a file containing a list of columns to extract, one per row. As with the single column checking can be either header, takes precedence over -c.
-l	<headerLookup>	Provide an optional header file for the data set, for use where the headers within the file are not human readable, requires two tab delimited columns, the first column should contain the ids that match the header in phenotypeFile.
-h		Show the help message.
	Version 4.1 Only
-k	<#columns per chunk>	Specify the number of columns to extract in each chunk for processing, default if not specified is 25.
-p		Specify to plot every column identified as numeric, or numeric coded, default is not to plot.

The categories reported by the program are:

Numeric: this category reports:: - Total number of non-missing variables.; - Total number of missing entries.; - Mean.; - Median.; - Standard deviation.; - Number of entries that are either greater or less than three standard deviations from the mean.
Text, Mixed, and Coded
These categories all report:: - Total number of non-missing variables.; - Total number of missing entries.; - Total number of unique entries, and if the number of unique entries is below 50 these are displayed with counts for each entry.
Coded:: Entries in this category can be text coded, numeric coded or mixed coded.; A column is described as containing a coded set of variables if the total number of unique values is less than 0.1% of the total number of entries, or 20, whichever is the greater.

The latest versions (v3.1 & v4.1) also produce a "fingerprint" summary of each row (assumed sample) and column (assumed phenotype) these are designed to verify whether a sample or phenotype has changed between data releases.
The fingerprint files (prefixed with FP, fingerprint phenotype, and FPS, fingerprint Sample) both contain the following columns:

Sample ID or column header.

Total numeric values.

Total non-numeric values.

Total non missing values.

Total unique values.

Total missing values.

Determined type.

These files can be used to determine whether a phenotype has changed between data releases, a script to do this comparison is under development and will be posted as soon as it is finalised.

Excel to Tab

A Perl wrapper using Spreadsheet::Read to convert various spreadsheet formats (xls, xlsx, csv, ods) to tab delimited text files.

Where the format supports sheets the output file name will include the sheet name and there will be one file per sheet.
Depending on your system you may need to add one or more of the following modules to your Perl install:

Spreadsheet::Read

Spreadsheet::ReadSXC

Spreadsheet::ParseExcel

Spreadsheet::ParseXLSX

Text::CSV_XS

Download

xls2tab.zip

Usage:

xls2tab -i <inputfile> [-o <outputfilestem> -f]

-i
inputfile
Path and name of the input file

-o
outputfilestem
Optional, if supplied will form the stem for all the output. For example using MyFile will lead to MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2 sheets named Sheet1 and Sheet2). If not supplied the name will be the same as the input, with the sheet name, if it exists, and the file extension, .txt.

-f

Optional, tells the program to select the formatted values from the cells, not the raw values. Default is for raw values

McCarthy Group Tools

ScatterShot

Download (13Mb)

Previous Versions

Strand and Position files for Common Genotyping Chips

AB Alleles to Top Strand

Physical Activity Program

HRC or 1000G Imputation preparation and checking

Downloads

Summary of checks and outputs:

Usage with HRC reference panel:

Usage with the TOPMed reference panel

Usage with 1000G reference panel

Usage with CAAPA reference panel

Usage with Asia reference panel

1000G reference panel file

Original script

Post Imputation Checking

UkBioBank Phenotype Extraction Script

Phenotype File Summary and Fingerprinting

Downloads

Usage:

Version 3.1

Version 4.1

v4.1 Usage:

Phenotype QC program command line options:

The categories reported by the program are:

Excel to Tab

Download

Usage:

-i	inputfile	Path and name of the input file
-o	outputfilestem	Optional, if supplied will form the stem for all the output. For example using MyFile will lead to MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2 sheets named Sheet1 and Sheet2). If not supplied the name will be the same as the input, with the sheet name, if it exists, and the file extension, .txt.
-f		Optional, tells the program to select the formatted values from the cells, not the raw values. Default is for raw values