pyseer: a comprehensive tool for microbial pangenome-wide association studies.

Lees JA; Galardini M; Bentley SD; Weiser JN; Corander J

doi:10.1093/bioinformatics/bty539

pyseer: a comprehensive tool for microbial pangenome-wide association studies.

Affiliations

1. Department of Microbiology, New York University School of Medicine, New York, NY, USA.
Authors
Lees JA¹
Weiser JN¹
(2 authors)
2. European Bioinformatics Institute (EMBL-EBI), Hinxton, UK.
Authors
Galardini M²
(1 author)
3. Department of Parasites and Microbes, Wellcome Sanger Institute, Hinxton, UK.
Authors
Bentley SD³
Corander J³
(2 authors)

ORCIDs linked to this article

Bioinformatics (Oxford, England), 01 Dec 2018, 34(24):4310-4312
https://doi.org/10.1093/bioinformatics/bty539 PMID: 30535304 PMCID: PMC6289128

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

This article is based on a previously available preprint.

Abstract

Summary

Genome-wide association studies (GWAS) in microbes have different challenges to GWAS in eukaryotes. These have been addressed by a number of different methods. pyseer brings these techniques together in one package tailored to microbial GWAS, allows greater flexibility of the input data used, and adds new methods to interpret the association results.

Availability and implementation

pyseer is written in python and is freely available at https://github.com/mgalardini/pyseer, or can be installed through pip. Documentation and a tutorial are available at http://pyseer.readthedocs.io.

Supplementary information

Supplementary data are available at Bioinformatics online.

Free full text

Bioinformatics. 2018 Dec 15; 34(24): 4310–4312.

Published online 2018 Jul 2. https://doi.org/10.1093/bioinformatics/bty539

PMCID: PMC6289128

PMID: 30535304

pyseer: a comprehensive tool for microbial pangenome-wide association studies

John A Lees,¹ Marco Galardini,² Stephen D Bentley,³ Jeffrey N Weiser,¹ and Jukka Corander^3,^4,⁵

Oliver Stegle, Associate Editor

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Associated Data

Supplementary Materials: Supplementary Material.
bty539_supplementary_material.pdf (306K)

Go to:

Abstract

Summary

Genome-wide association studies (GWAS) in microbes have different challenges to GWAS in eukaryotes. These have been addressed by a number of different methods. pyseer brings these techniques together in one package tailored to microbial GWAS, allows greater flexibility of the input data used, and adds new methods to interpret the association results.

Availability and implementation

pyseer is written in python and is freely available at https://github.com/mgalardini/pyseer, or can be installed through pip. Documentation and a tutorial are available at http://pyseer.readthedocs.io.

Supplementary information

Supplementary data are available at Bioinformatics online.

Go to:

1 Introduction

Finding genetic variation associated with bacterial phenotypes such as antibiotic resistance, virulence and host specificity has great potential for a better understanding of the evolution of these traits, and may be able to inform new clinical interventions. Genome-wide association studies (GWAS) address this in a hypothesis-free manner. The recent availability of thousands of whole-genome sequences from bacterial populations has made this approach possible, though issues of strong clonal population structure and a variable pangenome (sequence variation of both core regions present in every isolate, and in variably present accessory regions) must be accounted for in any successful analysis (Power et al., 2017).

One method that addresses these issues in a scalable manner is SEER (Lees et al., 2016), which uses non-redundant variable length k-mers between 9-100 bases as a generalized variant to represent variation across the pangenome. Linear models with a control for population structure are then used to perform the association. Other methods include: bugwas, which uses a linear mixed model (LMM) and also checks for lineage effects (Earle et al., 2016); scoary, which tests clusters of orthologous genes (COGs) with and without accounting for population structure (Brynildsrud et al., 2016); and phylogenetic regression, which uses a known phylogeny to adjust for the covariance between samples (Garland and Ives, 2000).

Each of these methods has its own set of advantages and limitations, so recent GWAS analyses have used a combination of these techniques along with methods designed for human genetics, often in a somewhat ad-hoc manner that requires tedious file format conversion and familiarity with a wide range of methods and software packages (Lees et al., 2017).

As the number of large bacterial population datasets increases, we recognize the need to make these methods more accessible to the field. We have therefore re-implemented SEER in an easy-to-use and install python package, pyseer. pyseer also includes many additional features covering new association models, input sources and output processing. This brings together all the methods mentioned above into a single piece of software, and further enables new combinations of analysis. Our method is equally applicable to viral GWAS, and indeed any species with structured populations and/or genomes with significant variation in gene content.

Go to:

2 Materials and methods

The foundation of pyseer is a direct python re-implementation of SEER (originally written in C++). K-mers of variable length counted from draft assemblies are used as the input, and their association with a phenotype of interest is assessed by fitting a generalized linear model to each k-mer. To control for population structure, multi-dimensional scaling of a pairwise distance matrix is performed and these components are included as fixed effects in each regression. In the case of a binary phenotype with a highly penetrant variant (those with high effect sizes), Firth regression is performed to maintain power. This adds a penalty to the logistic regression likelihood when there is a large error on the slope, which alleviates underestimation of p-values in cases with nearly separable data (Heinze and Schemper, 2002). Significant k-mers, after adjusting for multiple testing, can then be mapped to a reference annotation to find regions of the genome associated with the phenotype.

After re-implementation, results were the same as SEER (Supplementary Fig. S1) and pyseer is of comparable speed (Supplementary Table S1). We then expanded the features in pyseer to bring together the different methods mentioned above.

2.1 Input sources

In addition to k-mers, the original focus of SEER, it can be convenient to test for association of SNPs and INDELs called against a reference genome. These can be used to test for association of close or adjacent SNPs, which would be split into many low frequency k-mers. The consequence of these variants can be directly predicted (e.g. synonymous, frameshifting) which would require more downstream processing for k-mers. Presence or absence of COGs and aligned intergenic regions can also be useful, as shown by scoary (Brynildsrud et al., 2016).

pyseer can natively read all these types of variant from VCF or Rtab files, which also allows for any user defined input type (for example copy number variants). We also enable grouping of variants by genomic region to perform burden testing, now allowing users to perform analysis of rare variation while still accounting for population structure.

An important technical difficulty of using SEER, and most common linear mixed model implementations, is making sure that the membership and order of samples between the variant calls, phenotype file and population structure all match up correctly. Many methods do not check for this, and can run an incorrect analysis without producing a warning. In pyseer this label matching is done automatically, and the intersection of samples used is reported to the user.

2.2 Association models

pyseer implements the same fixed effect generalized linear regression model as SEER, including Firth regression. In pyseer we have added two new multidimensional scaling algorithms and a more streamlined interface with mash (Ondov et al., 2016) to compute population structure. Alternatively, when a high quality phylogeny is available, pyseer can use this to adjust for population structure in a manner analogous to phylogenetic regression.

As a major alternative we have included an LMM, which uses random effects to control for population structure, and has been shown to work across a range of scenarios. We have used the FaST-LMM implementation, which allows association in linear time (Lippert et al., 2011), using a kinship matrix estimated from a subset of variants or from a phylogeny. This now allows association of all k-mers under the mixed model in a few hours, which was previously computationally and bioinformatically challenging.

We have also included a method to estimate possible lineage effects, based on the procedure used in bugwas. Those variants which are associated with both the phenotype and with a lineage (which can be determined by pyseer or defined by the user) associated with the phenotype can be prioritized for further analysis outside of GWAS.

2.3 Output processing

It has been suggested that the number of unique variant patterns is a sensible way to set the multiple testing threshold, however these can be difficult to count due to the size of k-mer data. pyseer uses hashing to efficiently calculate this. We have also added tools to make interpretation of significant k-mers more streamlined. The user can map their results against any number of reference genomes for interactive display in phandango (Hadfield et al., 2018) (Supplementary Fig. S2). We have added the ability to summarize k-mer results at the gene level by iteratively mapping to reference and draft annotations, which can be used to show complementary information about effect size, coverage and minor allele frequency (Supplementary Fig. S3).

Go to:

3 Discussion

Starting with a re-implementation of SEER, we have added the models and input types used by other microbial and human GWAS approaches into a single package. Analyses which were previously challenging to perform, such as association of all variable length k-mers with a LMM, can be performed in an efficient and user-friendly manner. We have also enabled new types of analysis, such as population structure corrected burden testing, and gene level summaries of k-mers. Our package includes comprehensive documentation and a tutorial which shows how to perform GWAS using the new input sources and both association models, as well as how to interpret significant k-mers. We have implemented unit tests in our code to ensure consistency of output as features are added.

With pyseer we have therefore reconciled many of the existing methods for regression-based microbial GWAS into a single package. Our focus on documentation and ease-of-use of pyseer will make GWAS more accessible to the microbial genomics community.

Go to:

Supplementary Material

Click here for additional data file.^{(306K, pdf)}

Go to:

Acknowledgements

We wish to thank Leonor Sanchez-Buso for an early version of the k-mer annotation script. We would also like to thank Pedro Beltrao and Chrispin Chaguza for constructive comments.

Funding

JNW was funded by grants from the United States Public Health Service (AI038446 and AI105168). JC was funded by the ERC (grant no. 742158). SDB is funded by Wellcome (grant 098051).

Conflict of Interest: none declared.

Go to:

References

Brynildsrud O. et al. (2016) Rapid scoring of genes in microbial pan-genome-wide association studies with scoary. Genome Biol., 17, 238.. [Europe PMC free article] [Abstract] [Google Scholar]
Earle S.G. et al. (2016) Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat. Microbiol., 1, 16041.. [Europe PMC free article] [Abstract] [Google Scholar]
Garland T., Ives A.R. (2000) Using the past to predict the present: confidence intervals for regression equations in phylogenetic comparative methods. Am. Nat., 155, 346–364. [Abstract] [Google Scholar]
Hadfield J. et al. (2018) Phandango: an interactive viewer for bacterial population genomics. Bioinformatics, 34, 292–293. [Europe PMC free article] [Abstract] [Google Scholar]
Heinze G., Schemper M. (2002) A solution to the problem of separation in logistic regression. Stat. Med., 21, 2409–2419. [Abstract] [Google Scholar]
Lees J.A. et al. (2016) Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat. Commun., 7, 12797.. [Europe PMC free article] [Abstract] [Google Scholar]
Lees J.A. et al. (2017) Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. Elife, 6, e26255. [Europe PMC free article] [Abstract] [Google Scholar]
Lippert C. et al. (2011) FaST linear mixed models for genome-wide association studies. Nat. Methods, 8, 833–835. [Abstract] [Google Scholar]
Ondov B.D. et al. (2017) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol., 17, 1–14. [Europe PMC free article] [Abstract] [Google Scholar]
Power R.A. et al. (2017) Microbial genome-wide association studies: lessons from human GWAS. Nat. Rev. Genet., 18, 41–50. [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/bioinformatics/bty539

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/34/24/4310/27088675/bty539.pdf

Citations & impact

Impact metrics

138

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/44404310

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/44404310

Article citations

Using GWAS and Machine Learning to Identify and Predict Genetic Variants Associated with Foodborne Bacteria Phenotypic Traits.
Tsoumtsa Meda L, Lagarde J, Guillier L, Roussel S, Douarre PE
Methods Mol Biol, 2852:223-253, 01 Jan 2025
Cited by: 0 articles | PMID: 39235748
Review
Rifaximin prophylaxis causes resistance to the last-resort antibiotic daptomycin.
Turner AM, Li L, Monk IR, Lee JYH, Ingle DJ, Portelli S, Sherry NL, Isles N, Seemann T, Sharkey LK, Walsh CJ, Reid GE, Nie S, Eijkelkamp BA, Holmes NE, Collis B, Vogrin S, Hiergeist A, Weber D, [...] Carter GP
Nature, 23 Oct 2024
Cited by: 0 articles | PMID: 39443798
Integrative genomic, virulence, and transcriptomic analysis of emergent Streptococcus dysgalactiae subspecies equisimilis (SDSE) emm type stG62647 isolates causing human infections.
Eraso JM, Olsen RJ, Long SW, Gadd R, Boukthir S, Faili A, Kayal S, Musser JM
mBio, 15(11):e0257824, 17 Oct 2024
Cited by: 0 articles | PMID: 39417630 | PMCID: PMC11559094
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Intragenomic conflicts with plasmids and chromosomal mobile genetic elements drive the evolution of natural transformation within species.
Mazzamurro F, Chirakadavil JB, Durieux I, Poiré L, Plantade J, Ginevra C, Jarraud S, Wilharm G, Charpentier X, P C Rocha E
PLoS Biol, 22(10):e3002814, 14 Oct 2024
Cited by: 1 article | PMID: 39401218 | PMCID: PMC11472951
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Association of pathogenic determinants of Fusobacterium necrophorum with bacteremia, and Lemierre's syndrome.
Carrara A, Bertelli C, Gardiol C, Marquis B, Andrey DO, Schrenzel J, Pillonel T, Greub G
Sci Rep, 14(1):19804, 27 Aug 2024
Cited by: 0 articles | PMID: 39191804 | PMCID: PMC11350119
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (138) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC6289128?xr=true

Funding

Funders who supported this work.

European Research Council (1)

Scalable inference algorithms for Bayesian evolutionary epidemiology (SCARABEE)
Prof Jukka CORANDER, University of Oslo
Grant ID: 742158
73 publications

NIAID NIH HHS (2)

Grant ID: R01 AI105168
32 publications
Grant ID: R37 AI038446
70 publications

SDB

United States Public Health Service (2)

Grant ID: AI038446
2 publications
Grant ID: AI105168
2 publications

Wellcome Trust (1)

Wellcome Trust Sanger Institute - generic account for deposition of all core- funded research papers
Prof Sir Michael Stratton, Wellcome Trust Sanger Institute
Grant ID: 098051
1205 publications

Search life-sciences literature (45,103,589 articles, preprints and more)

pyseer: a comprehensive tool for microbial pangenome-wide association studies.

Author information

Affiliations

Authors

Authors

Authors

ORCIDs linked to this article

Abstract

Summary

Availability and implementation

Supplementary information

Free full text

pyseer: a comprehensive tool for microbial pangenome-wide association studies

John A Lees

Marco Galardini

Stephen D Bentley

Jeffrey N Weiser

Jukka Corander

Associated Data

Abstract

Summary

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

2.1 Input sources

2.2 Association models

2.3 Output processing

3 Discussion

Supplementary Material

Supplementary Material

Acknowledgements

Funding

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Similar Articles

Funding

European Research Council (1)﻿

NIAID NIH HHS (2)﻿

SDB

United States Public Health Service (2)﻿

Wellcome Trust (1)﻿

Partnerships & funding

European Research Council (1)

NIAID NIH HHS (2)

United States Public Health Service (2)

Wellcome Trust (1)