scholarly journals Benchmarking bacterial genome-wide association study (GWAS) methods using simulated genomes and phenotypes

2019 ◽  
Author(s):  
Morteza M. Saber ◽  
Jesse Shapiro

AbstractGenome Wide Association Studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true “hits” (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically evaluated. Here we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination, and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely-used univariate GWAS approaches (cluster-based, dimensionality-reduction, and linear mixed models, implemented in PLINK, pySEER, and GEMMA) and one relatively new whole-genome elastic net model implemented in pySEER, across a range of simulated sample sizes, recombination rates, and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the elastic net whole-genome model was consistently amongst the highest-performing methods and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached good performance (Recall > 0.75) to identify causal mutations of strong effect size (log Odds Ratio >= 2) with a sample size of 2000 genomes. However, only elastic nets reached reasonable performance (Recall = 0.35) for detecting markers with weaker effects (log OR ∼1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to univariate models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for whole-genome models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.Author summaryMicrobial populations contain measurable phenotypic differences with important clinical and environmental consequences, such as antibiotic resistance, virulence, host preference and transmissibility. A major challenge is to discover the genes and mutations in bacterial genomes that control these phenotypes. Bacterial Genome-Wide Association Studies (GWASs) are family of methods to statistically associate phenotypes with genotypes, such as point mutations and other variants across the genome. However, compared to sexual organisms such as humans, bacteria reproduce clonally meaning that causal mutations tend to be strongly linked to other mutations on the same chromosome. This genome-wide linkage makes it challenging to statistically separate causal mutations from non-causal false-positive associations. Several GWAS methods are currently available, but it is not clear which is the most powerful and accurate for bacteria. To systematically evaluate these methods, we developed BacGWASim, a computational pipeline to simulate the evolution of bacterial genomes and phenotypes. Using simulated genomes, we found that GWAS methods varied widely in their performance. In general, causal mutations of strong effect (e.g. those under strong selection for antibiotic resistance) could be easily identified with relatively small samples sizes of around 1000 genomes, but more complex phenotypes controlled by mutations of weaker effect required 3000 genomes or more. We found that a recently-developed GWAS method called elastic net was particularly good at identifying causal mutations in highly clonal populations, with strong linkage between mutations – but there is still room for improvement. The BacGWASim computer code is publicly available to enable further comparisons and benchmarking of new methods.

2018 ◽  
Author(s):  
Magali Jaillard ◽  
Leandro Lima ◽  
Maud Tournoud ◽  
Pierre Mahé ◽  
Alex van Belkum ◽  
...  

AbstractMotivationGenome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or fine-assessment of marker effect. Recently, alignment-free methods based on kmer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are hard to interpret.MethodsHere, we introduce DBGWAS, an extended kmer-based GWAS method producing interpretable genetic variants associated with pheno-types. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes identified by the association model into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is fast, alignment-free and only requires a set of contigs and phenotypes. It produces annotated subgraphs representing local polymorphisms as well as mobile genetic elements (MGE) and offers a graphical framework to interpret GWAS results.ResultsWe validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa – along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature.ConclusionOur novel method proved its efficiency to retrieve any type of phenotype-associated genetic variant without prior knowledge. All experiments were computed in less than two hours and produced a compact set of meaningful subgraphs, thereby outperforming other GWAS approaches and facilitating the interpretation of the results.AvailabilityOpen-source tool available at https://gitlab.com/leoisl/dbgwas


Author(s):  
Kevin C Ma ◽  
Tatum D Mortimer ◽  
Marissa A Duckett ◽  
Allison L Hicks ◽  
Nicole E Wheeler ◽  
...  

AbstractThe emergence of resistance to azithromycin complicates treatment of N. gonorrhoeae, the etiologic agent of gonorrhea. Population genomic analyses of clinical isolates have demonstrated that some azithromycin resistance remains unexplained after accounting for the contributions of known resistance mutations in the 23S rRNA and the MtrCDE efflux pump. Bacterial genome-wide association studies (GWAS) offer a promising approach for identifying novel resistance genes but must adequately address the challenge of controlling for genetic confounders while maintaining power to detect variants with lower effect sizes. Compared to a standard univariate GWAS, conducting GWAS conditioned on known resistance mutations with high effect sizes substantially reduced the number of variants that reached genome-wide significance and identified a G70D mutation in the 50S ribosomal protein L4 (encoded by the gene rplD) as significantly associated with increased azithromycin minimum inhibitory concentrations (β = 1.03, 95% CI [0.76, 1.30]). The role and prevalence of these rplD mutations in conferring macrolide resistance in N. gonorrhoeae had been unclear. Here, we experimentally confirmed our GWAS results, identified other resistance-associated mutations in RplD, and showed that in total these RplD binding site mutations are prevalent (present in 5.42% of 4850 isolates) and geographically and temporally widespread (identified in 21/65 countries across two decades). Overall, our findings demonstrate the utility of conditional associations for improving the performance of microbial GWAS and advance our understanding of the genetic basis of macrolide resistance in a prevalent multidrug-resistant pathogen.


2018 ◽  
Vol 8 (1) ◽  
Author(s):  
Gabriel Costa Monteiro Moreira ◽  
Clarissa Boschiero ◽  
Aline Silva Mello Cesar ◽  
James M. Reecy ◽  
Thaís Fernanda Godoy ◽  
...  

2020 ◽  
Vol 27 (9) ◽  
pp. 1425-1430
Author(s):  
Inès Krissaane ◽  
Carlos De Niz ◽  
Alba Gutiérrez-Sacristán ◽  
Gabor Korodi ◽  
Nneka Ede ◽  
...  

Abstract Objective Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies. Methods We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset. Results Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics. Conclusions We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?


Sign in / Sign up

Export Citation Format

Share Document