Abstract
Background
The advent of cheap, large scale genotyping has led to widespread adoption of genetic association mapping as the tool of choice in the search for loci underlying susceptibility to common complex disease. Whilst simple single locus analysis is relatively trivial to conduct, this is not true of more complex analysis such as those involving interactions between loci. The importance of testing for interactions between loci in association analysis has been highlighted in a number of recent high profile publications.Results
Genetic Association Interaction Analysis (GAIA) is a web-based application for testing for statistical interactions between loci. It is based upon the widely used case-control study design for genetic association analysis and is designed so that non-specialists may routinely apply tests for interaction. GAIA allows simple testing of both additive and additive plus dominance interaction models and includes permutation testing to appropriately correct for multiple testing. The application will find use both in candidate gene based studies and in genome-wide association studies. For large scale studies GAIA includes a screening approach which prioritizes loci (based on the significance of main effects at one or both loci) for further interaction analysis.Conclusion
GAIA is available at http://www.bbu.cf.ac.uk/html/research/biostats.htm.Free full text
GAIA: An easy-to-use web-based application for interaction analysis of case-control data
Abstract
Background
The advent of cheap, large scale genotyping has led to widespread adoption of genetic association mapping as the tool of choice in the search for loci underlying susceptibility to common complex disease. Whilst simple single locus analysis is relatively trivial to conduct, this is not true of more complex analysis such as those involving interactions between loci. The importance of testing for interactions between loci in association analysis has been highlighted in a number of recent high profile publications.
Results
Genetic Association Interaction Analysis (GAIA) is a web-based application for testing for statistical interactions between loci. It is based upon the widely used case-control study design for genetic association analysis and is designed so that non-specialists may routinely apply tests for interaction. GAIA allows simple testing of both additive and additive plus dominance interaction models and includes permutation testing to appropriately correct for multiple testing. The application will find use both in candidate gene based studies and in genome-wide association studies. For large scale studies GAIA includes a screening approach which prioritizes loci (based on the significance of main effects at one or both loci) for further interaction analysis.
Conclusion
GAIA is available at http://www.bbu.cf.ac.uk/html/research/biostats.htm
Background
Genetic association mapping is one of the primary tools used to identify loci involved in common complex disease. Such analyses are typically implemented by testing for a difference between allele frequencies at a locus in a population sample of cases and controls. However, such an approach only considers one locus at a time. Most common diseases will be genetically complex, with multiple loci contributing to disease susceptibility. Epistasis is the phenomenon where the phenotypic effect of one locus changes as a result of the genotype at one or more other loci. The importance of epistasis has been strongly emphasised recently [1-3], with the poor replication rate of human genetic association studies cited as being partly attributable to the lack of consideration given to epistatic effects [4,5]. Another recent paper [6] has suggested the power of large scale studies may be substantially improved by considering interactions among loci.
Appropriate analysis of population data may be invaluable in identifying loci that exhibit significant interaction (in the statistical sense). Although analyses which consider interaction terms can be implemented in packages such as R [7] or STATA (for scripts and further details see [8]), such analyses are difficult for non-specialists to implement and cannot be readily applied to large numbers of genetic markers. Since large volumes of population data are now being generated in many molecular genetics laboratories there is an urgent need for applications which can streamline the data analysis stage of genetic association mapping projects. GAIA, a freely available, easy-to-use web application, allows non-specialist users to routinely test for interactions.
Methods
Regression model
GAIA uses perl CGI scripts to code the data, with the R package [7] used for the necessary statistical routines. The application uses a regression model which allows the user to test for pairwise locus-locus interactions between genes. For the case-control data sets typically employed, this utilizes the logistic regression model
where p is the probability of each individual being a case, xi and zi are dummy variables with xi = 1, zi = -0.5 for one homozygote genotype, xi = 0, zi = 0.5 for the heterozygote genotypes and xi = -1, zi = -0.5 for the other homozygote [9]. We assume a diallelic locus such as a single nucleotide polymorphism (SNP). μ corresponds to the mean effect. The terms a1, d1, a2, d2 represent the parameters corresponding to the additive and dominance effects at the two SNPs (i.e. the main effects). Similarly, iaa, iad, ida, idd represent the epistatic interaction effects.
Implemented tests
GAIA allows an "additive plus dominance" 4 degree of freedom (df) test of interaction in which a model with the terms iaa, iad, ida, idd is compared with one without them; in GAIA this is referred to as the "interaction only" p-value. The "interaction only" p-value tests the significance of the interaction model terms over and above any main effects. Also implemented is an 8 df test of overall significance (model with all terms compared with model with only μ fitted); in GAIA this is referred to as the "overall" p-value. The "overall" p-value tests the significance of all terms in the model (i.e. the joint significance of both main effects and interaction effects). By dropping dominance terms GAIA can perform an "additive only" 1 df interaction test (i.e. a model with a1x1, a2x2 and iaa versus a model with a1x1 and a2x2) and a 3 df overall significance test (i.e. a model with a1x1, a2x2 and iaa versus a model with only μ). The significance of the relevant model terms can be evaluated by comparing twice the log-likelihood difference between models with a χ2 distribution. Alternatively, permutation based tests can be applied (see below). The parsimonious models with only additive effects will be powerful when dominance effects are moderate or small. For markers with small minor allele frequencies it may not be possible to fit all the interaction terms with the additive plus dominance model; this results because some of the relevant locus-locus genotype are not present in the data. In general, with n and m model terms for each marker, the interaction test will have n × m df. This means that interaction tests involving multiallelic markers and/or haplotypes will have large numbers of degrees of freedom. Tests based on large df are unlikely to powerful for general screening of genes and hence are not implemented in the web application. Multiallelic markers can of course be downcoded to two alleles for use in the application. GAIA is intended for use with genes that are in linkage equilibrium (for example genes far apart on the same chromosome or on different chromosomes). For nearby gene pairs the application still gives valid results but this sort of data is probably better dealt with by constructing haplotype based tests of association.
Input format
GAIA has a flexible input format allowing either i) the data for both genes to be coded in a single file, or ii) separate data files for each gene; files are automatically merged based on matching values in the first field. The input file is required to be in "Linkage" format, which is essentially the de facto standard for coding genetic data [10]. The user then specifies the marker(s) of interest and the analysis options required. In addition to additive only and additive plus dominance interaction models the program can output standard allelic (i.e. additive only) and genotypic (i.e. additive plus dominance) tests for each SNP singly.
Permutation testing
Although Bonferroni corrections can be readily applied for independent statistical tests, markers within the same gene are likely to have correlated allele frequencies. We hence apply a permutation procedure which appropriately takes into account this non-independence [11]. The two different possible significance tests require different permutation tests. The test for the significance of the interaction terms (over and above the main effects) is performed by permuting the interaction model terms, with the main effects and disease status remaining as in the original data set. The test for overall significance is performed by permuting the status variable and keeping the other model terms fixed. The possible permutation tests are discussed further in a paper by Carlborg and Andersson [12]. The relevant test statistics are recalculated a large number of times, with the appropriate permutation procedure applied each time. By sorting the resultant p-values we can calculate an empirical p-value. The permutation procedure can also be used to correct for the non-independence of the two possible interaction tests.
In the simple case where the permutation test is conducted on a single marker, the permutation p-values should be very similar to the (asymptotic) p-values from a χ2 distribution on the appropriate degrees of freedom. We therefore validated the "interaction only" test by comparing the asymptotic p-value (from a
Results
Candidate gene example
We used GAIA to analyse data from a case control study of the candidate gene GENEX in schizophrenia (real gene name suppressed). We wanted to test for interactions between GENEX and the gene GENEXInteractingProtein (which are on different chromosomes). There was a clear biological motivation for testing for interaction between SNPs in these two genes. We used GAIA to test 20 SNPs in GENEX for interactions with a SNP in GENEXInteractingProtein. 673 cases and 716 controls were available. The additive "interaction only" test was utilised. Applying GAIA to the available case control data yielded evidence for an interaction, with the p-value of 0.00033 for the significance of the interaction (over and above the main effects) for a marker in GENEX. A few of the 20 SNPs in GENEX were in strong linkage disequilibrium so applying a Bonferroni correction for the multiple markers tested would be overly conservative (Bonferroni corrected p-value was 0.0066). Applying the web applications permutation correction for multiple testing yielded a p-value of 0.0058. In a test run on this data set, 10000 permutations took 10 hours with GAIA. It is interesting to note that, when we examined GENEX and GENEX-InteractingProtein, we did not find significant main effects for either of the two SNPs that were found to interact (although we did find significant main effects for some of the other SNPs in GENEX).
Large scale analysis
With approximately 30000 human genes and hence potentially
In a wider, chromosome- or genome-wide context, there may also be value in applying interaction analysis, with the improved power outweighing the cost of the multiple-testing correction [6]. For large scale data GAIA can implement a screening approach. Loci are screened on a SNP by SNP basis with SNPs reaching a nominal level of significance (p < 0.05 for the additive single marker test) followed through to a secondary stage. The user can then apply one of the following approaches
1. test for interactions between the nominally significant SNPs
2. test for interactions between the nominally significant SNPs and all of the SNPs in the original data set
To perform this procedure in GAIA, the web application is first used to generate a file containing the nominally significant SNPs. This file is then either i) reloaded into GAIA in both input boxes (first approach), or ii) reloaded into one input box with the original data file loaded into the other input box (second approach). GAIA is then run as usual on the relevant subset of SNPs. Although both the "interaction only" and the "overall" test can be applied here, recent research suggests that utilising the "overall" test may be particularly fruitful here (see also discussion section). More detailed instructions on performing the screening procedure are included on the GAIA web page.
Although the first screening approach is computationally less intensive and requires fewer tests, we would recommend the second of the two approaches. This is because the increase in the number of tests is modest and it is rather restrictive to only test pairs where both have main effects. In many realistic scenarios where epistatic effects are important, the main effect of at least one of the interacting loci would not be significant [6] and hence should ideally not be discarded in the screening stage.
To test the feasibility of applying GAIA to a large number of markers we applied the screening approach to a set of 600 SNPs typed in 135 cases and 289 controls across chromosome 10. Testing for all possible pairwise combinations would involve 600 × 599/2 = 179700 interaction tests. Whilst not impossible, this number of tests on these data would take 15 hours computing time with GAIA. Applying the first and second screening approaches above reduced the number of interaction tests to
With large numbers of markers the applications capacity for permutation analysis is limited. However, since a relatively small proportion of such loci will be correlated, the Bonferroni correction will not be overly conservative when applied to large numbers of markers (cf. situation with a small number of markers within a candidiate gene where LD may be strong and hence the Bonferroni correction rather conservative).
Discussion
GAIA allows researchers to apply two different tests. One test, the "interaction only" test, considers the significance of the interaction terms on their own (over and above main effects). The other test, the "overall" test, considers the overall significance of both the main (or marginal) effects and the interaction effects together (i.e. a model with the terms a1, d1, a2, d2, iaa, iad, ida, idd compared with a model without them). The tests will be useful in different situations. The "interaction only" test will be most useful in candidate gene studies; for example in the schizophrenia data described here, there was evidence for statistical interaction between two biologically related genes. The "overall" test will be useful as a replacement for association testing of large numbers of loci singly. The "overall" test was discussed in this context by Marchini et al [6]; they show that models with interaction terms can be more powerful than simpler models which ignore interaction. Power improvements were shown both for a brute force approach which tested all possible interactions and an approach which screened loci for nominal significance [6]. In many realistic scenarios they show that the improved power outweighs the cost of the multiple-testing correction. Essentially, the increase in significance when fitting the "correct" model scales better with sample size than magnitude of multiple testing correction [3]. Models similar to those described by Marchini et al were considered recently by Millstein et al [15]. Millstein et al apply a slightly different set of sequential tests. Tests are done by selectively conditioning on previous results from single locus tests [15]. Another approach which addressed some of the same issues (but not in a human genetics context) is Carlborg and Andersson [12]. In GAIA the implemented approach for sequential testing involves applying the screening approach described in the previous section. Consider the 600 SNP example outlined earlier. With 600 SNPs, one would expect to find approximately 30 SNPs that were significant on their own (at a nominal 5% significance level and assuming only a small proportion of loci actually influence disease risk). A useful screen (similar to that described in strategy III from Marchini et al) with the "overall" test would therefore be to compute the 600 × 30 -
Logistic regression based interaction has been utilised by a various authors [6,9,15]. Although this method can be applied using standard statistical packages, GAIA facilitates simple application of the method with the added advantage of permutation analysis and simple screening for inclusion of SNPs in the interaction test. A non-parametric alternative to parametric analyses such as logistic regression is Multifactor Dimensionality Reduction (MDR) [16]. The MDR approach avoids specifying a particular model for the interactions and instead bases its inferences on sets of "high" and "low" risk multilocus genotypes. This approach can be powerful for certain models of interaction with little or no main effects. However, for many realistic models of interaction, the MDR approach has been shown to be less powerful than approaches based on logistic regression [15].
GAIA does not currently accommodate family-based association design data. Tests analogous to the family based Transmission Distortion Test (TDT, [17] and refinements) can be conducted through the use of conditional logistic regression [18] and this accommodates the linear modeling of interactions. However, such methods are most powerful when there are informative transmissions from heterozygote parents and the use of highly polymorphic markers (with high heterozygosity) undesirably leads to large numbers of degrees of freedom in the tests for interactions described above. This, combined with the larger sample sizes usually available, means that case-control design is likely to be most suitable for interaction analysis. It is important to differentiate between biological epistasis (e.g. where two or more genes are involved in the same biological pathway and are jointly responsible for the end phenotype) and statistical epistasis (i.e. the deviation of the terms iaa, iad, ida, idd from zero in the linear model stated above). Biological epistasis occurs at the individual level whereas statistical epistasis necessarily is based upon populations. There is no direct relationship between these two definitions of epistasis and the existence of a number of possible parameterizations of the penetrances (parameters that define the genotype-phenotype relationship for binary traits) mean that the significance of the interaction terms maybe scale dependent [9]. In GAIA we utilise the log odds of the penetrance; this function is widely used in epidemiological studies and yields results comparable to those obtained from standard contingency tables when applied to single SNPs. For further discussion of the biological/statistical epistasis issue see [9,14].
Conclusion
GAIA allows non-specialists access to interaction analysis of genetic association data. In our lab the application has allowed such users to routinely screen the candidate genes they are currently interested in against a set of established loci for a variety of genetically complex psychiatric diseases. By combining appropriate biological information on the genes underlying the detected statistical interactions, GAIA users should be able to better understand the aetiology of the disease under study. GAIA also allows interaction analysis to be applied on a larger scale. A practical screening facility which discards loci not showing main effects at either locus is provided in GAIA to make large scale analysis tractable.
Availability and requirements
GAIA is accessible via http://www.bbu.cf.ac.uk/html/research/biostats.htm and is freely available for use by academics and non-academics. The source code for GAIA (perl and R) is available from the above URL. GAIA currently runs on 2 separate Intel based PCs (both accessible via the above URL).
Authors' contributions
SM and IAK planned the study. SM and IAK wrote and tested the software. SM wrote the paper. All authors read and approved the final manuscript.
Pre-publication history
The pre-publication history for this paper can be accessed here:
Acknowledgements
We thank the Higher Education Funding Council for Wales and QIMR for financial support. We are indebted to the participants in our schizophrenia study. We thank Michael O'Donovan and Peter Visscher for helpful comments.
References
- Carlborg O, Haley CS. Epistasis: too often neglected in complex trait studies? Nat Rev Genet. 2004;5:618–625. 10.1038/nrg1407. [Abstract] [CrossRef] [Google Scholar]
- Moore JH. A global view of epistasis. Nature Genet. 2005;37:13–14. 10.1038/ng0105-13. [Abstract] [CrossRef] [Google Scholar]
- Daly MJ, Altshuler D. Partners in crime. Nature Genet. 2005;37:337–338. 10.1038/ng0405-337. [Abstract] [CrossRef] [Google Scholar]
- Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet Med. 2002;4:45–61. [Abstract] [Google Scholar]
- Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56:73–82. 10.1159/000073735. [Abstract] [CrossRef] [Google Scholar]
- Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genet. 2005;37:413–417. 10.1038/ng1537. [Abstract] [CrossRef] [Google Scholar]
- R Development Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2004. http://www.R-project.org ISBN 3-900051-00-3.
- Cordell HJ, Clayton DG. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. Am J Hum Genet. 2002;70:124–141. 10.1086/338007. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
- Cordell HJ. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002;11:2463–2468. 10.1093/hmg/11.20.2463. [Abstract] [CrossRef] [Google Scholar]
- Linkage User's Guide http://linkage.rockefeller.edu/soft/linkage
- Zhao J, Curtis D, Sham P. Model-free analysis and permutation tests for allelic associations. Hum Hered. 2000;50:133–139. 10.1159/000022901. [Abstract] [CrossRef] [Google Scholar]
- Carlborg O, Andersson L. Use of randomization testing to detect multiple epistatic QTLs. Genet Res. 2002;79:175–184. 10.1017/S001667230200558X. [Abstract] [CrossRef] [Google Scholar]
- Culverhouse R, Suarez BK, Lin J, Reich T. A perspective on epistasis: Limits of models displaying no main effect. Am J Hum Genet. 2002;70:461–471. 10.1086/338759. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
- Moore JH, Williams SM. Traversing the conceptual divide between biological and statistical epistasis: Systems biology and a more modern synthesis. Bioessays. 2005;27:637–646. 10.1002/bies.20236. [Abstract] [CrossRef] [Google Scholar]
- Millstein J, Conti DV, Gilliland FD, Gauderman WJ. A testing framework for identifying susceptibility genes in the presence of epistasis. American Journal Of Human Genetics. 2006;78:15–27. 10.1086/498850. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
- Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19:376–382. 10.1093/bioinformatics/btf869. [Abstract] [CrossRef] [Google Scholar]
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium – the insulin gene region and insulin-dependent diabetes-mellitus (IDDM) Am J Hum Genet. 1993;52:506–516. [Europe PMC free article] [Abstract] [Google Scholar]
- Schaid DJ. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol. 1996;13:423–449. 10.1002/(SICI)1098-2272(1996)13:5<423::AID-GEPI1>3.0.CO;2-3. [Abstract] [CrossRef] [Google Scholar]
Articles from BMC Medical Genetics are provided here courtesy of BMC
Full text links
Read article at publisher's site: https://doi.org/10.1186/1471-2350-7-34
Read article for free, from open access legal sources, via Unpaywall: https://bmcmedgenet.biomedcentral.com/track/pdf/10.1186/1471-2350-7-34
Citations & impact
Impact metrics
Citations of article over time
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1186/1471-2350-7-34
Article citations
Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models.
PLoS One, 17(2):e0263390, 18 Feb 2022
Cited by: 4 articles | PMID: 35180244 | PMCID: PMC8856572
Allelic heterogeneity in NCF2 associated with systemic lupus erythematosus (SLE) susceptibility across four ethnic populations.
Hum Mol Genet, 23(6):1656-1668, 26 Oct 2013
Cited by: 43 articles | PMID: 24163247 | PMCID: PMC3929085
Genetics of toll like receptor 9 in ANCA associated vasculitides.
Ann Rheum Dis, 73(5):890-896, 16 Apr 2013
Cited by: 26 articles | PMID: 23592712
Challenges and opportunities in genome-wide environmental interaction (GWEI) studies.
Hum Genet, 131(10):1591-1613, 04 Jul 2012
Cited by: 89 articles | PMID: 22760307 | PMCID: PMC3677711
Review Free full text in Europe PMC
Perspectives on genome-wide multi-stage family-based association studies.
Stat Med, 30(18):2201-2221, 17 May 2011
Cited by: 2 articles | PMID: 21590702
Go to all (17) article citations
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
An entropy-based approach for testing genetic epistasis underlying complex diseases.
J Theor Biol, 250(2):362-374, 06 Oct 2007
Cited by: 31 articles | PMID: 17996908
Detailed analysis of the relative power of direct and indirect association studies and the implications for their interpretation.
Hum Hered, 64(1):63-73, 27 Apr 2007
Cited by: 17 articles | PMID: 17483598
SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies.
Bioinformatics, 25(4):504-511, 19 Dec 2008
Cited by: 75 articles | PMID: 19098029
The current status of association studies in obsessive-compulsive disorder.
Psychiatr Clin North Am, 29(2):411-444, 01 Jun 2006
Cited by: 39 articles | PMID: 16650716
Review