Biological Machine Learning Combined with Campylobacter Population Genomics Reveals Virulence Gene Allelic Variants Cause Disease

DJ Darwin R. Bandoy; Bart C. Weimer

doi:10.3390/microorganisms8040549

Biological Machine Learning Combined with Campylobacter Population Genomics Reveals Virulence Gene Allelic Variants Cause Disease

Microorganisms ◽

10.3390/microorganisms8040549 ◽

2020 ◽

Vol 8 (4) ◽

pp. 549 ◽

Cited By ~ 2

Author(s):

DJ Darwin R. Bandoy ◽

Bart C. Weimer

Keyword(s):

Machine Learning ◽

Bacterial Population ◽

Genome Wide Association Study ◽

Population Genomics ◽

Virulence Gene ◽

Spatiotemporal Analysis ◽

Nucleotide Polymorphisms ◽

Allelic Variants ◽

Automated Annotation

Highly dimensional data generated from bacterial whole-genome sequencing is providing an unprecedented scale of information that requires an appropriate statistical analysis framework to infer biological function from populations of genomes. The application of genome-wide association study (GWAS) methods is an appropriate framework for bacterial population genome analysis that yields a list of candidate genes associated with a phenotype, but it provides an unranked measure of importance. Here, we validated a novel framework to define infection mechanism using the combination of GWAS, machine learning, and bacterial population genomics that ranked allelic variants that accurately identified disease. This approach parsed a dataset of 1.2 million single nucleotide polymorphisms (SNPs) and indels that resulted in an importance ranked list of associated alleles of porA in Campylobacter jejuni using spatiotemporal analysis over 30 years. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This framework, termed μPathML, defined intestinal and extraintestinal groups that have differential allelic porA variants that cause abortion. Divergent variants containing indels that defeated automated annotation were rescued using biological context and knowledge that resulted in defining rare, divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled with GWAS and population genomics to simultaneously identify and rank alleles to define their role in infectious disease mechanisms.

Download Full-text

Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

10.1101/739540 ◽

2019 ◽

Author(s):

DJ Darwin R. Bandoy ◽

Bart C. Weimer

Keyword(s):

Machine Learning ◽

Bacterial Population ◽

Genome Wide Association Study ◽

Biological Function ◽

Population Genomics ◽

Accurate Estimation ◽

Allelic Variants ◽

Genome Wide ◽

Spatial Locations

AbstractHighly dimensional data generated from bacterial whole genome sequencing is providing unprecedented scale of information that requires appropriate statistical frameworks of analysis to infer biological function from bacterial genomic populations. Application of genome wide association study (GWAS) methods is an emerging approach with bacterial population genomics that yields a list of genes associated with a phenotype with an undefined importance among the candidates in the list. Here, we validate the combination of GWAS, machine learning, and pathogenic bacterial population genomics as a novel scheme to identify SNPs and rank allelic variants to determine associations for accurate estimation of disease phenotype. This approach parsed a dataset of 1.2 million SNPs that resulted in a ranked importance of associated alleles of Campylobacter jejuni porA using multiple spatial locations over a 30-year period. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This approach, termed BioML, defined intestinal and extraintestinal groups that have differential allelic variants that cause abortion. Divergent variants containing indels that defeated gene callers were rescued using biological context and knowledge that resulted in defining rare and divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled to GWAS and population genomics to simultaneously identify and rank alleles to define their role in abortion, and more broadly infectious disease.

Download Full-text

Identification of Nitrogen Fixation Genes in Lactococcus Isolated from Maize Using Population Genomics and Machine Learning

Microorganisms ◽

10.3390/microorganisms8122043 ◽

2020 ◽

Vol 8 (12) ◽

pp. 2043

Author(s):

Shawn M. Higdon ◽

Bihua C. Huang ◽

Alan B. Bennett ◽

Bart C. Weimer

Keyword(s):

Machine Learning ◽

Nitrogen Fixation ◽

Genome Wide Association Study ◽

Population Genomics ◽

Genomic Analysis ◽

Oxidation Reduction ◽

Genome Wide ◽

Biological Nitrogen ◽

Carbohydrate Catabolism ◽

Comparative Population

Sierra Mixe maize is a landrace variety from Oaxaca, Mexico, that utilizes nitrogen derived from the atmosphere via an undefined nitrogen fixation mechanism. The diazotrophic microbiota associated with the plant’s mucilaginous aerial root exudate composed of complex carbohydrates was previously identified and characterized by our group where we found 23 lactococci capable of biological nitrogen fixation (BNF) without containing any of the proposed essential genes for this trait (nifHDKENB). To determine the genes in Lactococcus associated with this phenotype, we selected 70 lactococci from the dairy industry that are not known to be diazotrophic to conduct a comparative population genomic analysis. This showed that the diazotrophic lactococcal genomes were distinctly different from the dairy isolates. Examining the pangenome followed by genome-wide association study and machine learning identified genes with the functions needed for BNF in the maize isolates that were absent from the dairy isolates. Many of the putative genes received an ‘unknown’ annotation, which led to the domain analysis of the 135 homologs. This revealed genes with molecular functions needed for BNF, including mucilage carbohydrate catabolism, glycan-mediated host adhesion, iron/siderophore utilization, and oxidation/reduction control. This is the first report of this pathway in this organism to underpin BNF. Consequently, we proposed a model needed for BNF in lactococci that plausibly accounts for BNF in the absence of the nif operon in this organism.

Download Full-text

27 Machine Learning Algorithms Based on Haplotype Libraries for Classification of Stillbirth Susceptibility in Holstein Cows

Journal of Animal Science ◽

10.1093/jas/skab235.025 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 15-16

Author(s):

Pablo A S Fonseca ◽

Massimo Tornatore ◽

Angela Cánovas

Keyword(s):

Machine Learning ◽

Genome Wide Association Study ◽

Linear Models ◽

Predictive Accuracy ◽

Area Under The Curve ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Economic Losses ◽

Nucleotide Polymorphisms ◽

Birth Records

Abstract Reduced fertility is one of the main causes of economic losses in dairy farms. The cost of a stillbirth is estimated in US$ 938 per case in Holstein herds. Machine learning (ML) is gaining popularity in the livestock sector as a mean to identify hidden patterns and due to its potential to address dimensionality problems. Here we investigate the application of ML algorithms for the prediction of cows with higher stillbirth susceptibility in two scenarios: cows with >25% and >33.33% of stillbirths among birth records. These thresholds correspond to percentiles 75 (still_75) and 90 (still_90), respectively. A total of 10,570 cows and 50,541 birth records were collected to perform a haplotype-based genome-wide association study. Five-hundred significant pseudo single nucleotide polymorphisms (pseudo-SNPs) (False-Discovery Rate< 0.05) were used as input features of ML-based predictions to determine if the cow is in the top-75 and top-90 percentiles. Table 1 shows the classification performance of the investigated ML and linear models. The ML models outperformed linear models for both thresholds. In general, still_75 showed higher F1 values compared to still_90, suggesting a lower misclassification ratio when a less stringent threshold is used. We observe that accuracy of the models in our study is higher when compared to ML-based prediction accuracies in other breeds, e.g. compared to the accuracies of 0.46 and 0.67 that were achieved using SNPs for body weight in Brahman and fertility traits in Nellore, respectively. Xgboost algorithm shows the highest balanced accuracy (BA; 0.625), F1-score (0.588) and area under the curve (AUC; 0.688), suggesting that xgboost can achieve the highest predictive performance and the lowest difference in misclassification ratio between classes. The ML applied over haplotype libraries is an interesting approach for the detection of animals with higher susceptibility to stillbirths due to highest predictive accuracy and relatively lower misclassification ratio.

Download Full-text

Machine learning suggests polygenic contribution to cognitive dysfunction in amyotrophic lateral sclerosis (ALS)

10.1101/2019.12.23.19014407 ◽

2019 ◽

Author(s):

Katerina Placek ◽

Michael Benatar ◽

Joanne Wuu ◽

Evadnie Rampersaud ◽

Laura Hennessy ◽

...

Keyword(s):

Machine Learning ◽

Amyotrophic Lateral Sclerosis ◽

Cognitive Dysfunction ◽

Premotor Cortex ◽

Temporal Cortex ◽

Anterior Cingulate ◽

Polygenic Risk Score ◽

Nucleotide Polymorphisms ◽

Lateral Sclerosis

AbstractAmyotrophic lateral sclerosis (ALS) is a multi-system disease characterized primarily by progressive muscle weakness. Cognitive dysfunction is commonly observed in patients, however factors influencing risk for cognitive dysfunction remain elusive. Using sparse canonical correlation analysis (sCCA), an unsupervised machine-learning technique, we observed that single nucleotide polymorphisms collectively associate with baseline cognitive performance in a large ALS patient cohort from the multicenter Clinical Research in ALS and Related Disorders for Therapeutic Development (CReATe) Consortium (N=327). We demonstrate that a polygenic risk score derived using sCCA relates to longitudinal cognitive decline in the same cohort, and also to in vivo cortical thinning in the orbital frontal cortex, anterior cingulate cortex, lateral temporal cortex, premotor cortex, and hippocampus (N=114) as well as post mortem motor cortical neuronal loss (N=88) in independent ALS cohorts from the University of Pennsylvania Integrated Neurodegenerative Disease Biobank. Our findings suggest that common genetic polymorphisms may exert a polygenic contribution to the risk of cortical disease vulnerability and cognitive dysfunction in ALS.

Download Full-text

Vibrio harveyi virulence gene expression in vitro and in vivo during infection in black tiger shrimp Penaeus monodon

Diseases of Aquatic Organisms ◽

10.3354/dao03475 ◽

2020 ◽

Vol 139 ◽

pp. 153-160

Author(s):

S Peeralil ◽

TC Joseph ◽

V Murugadas ◽

PG Akhilnath ◽

VN Sreejith ◽

...

Keyword(s):

Gene Expression ◽

Virulence Genes ◽

Vibrio Harveyi ◽

Penaeus Monodon ◽

Challenge Test ◽

Virulence Gene ◽

Economic Losses ◽

Virulence Gene Expression

Luminescent Vibrio harveyi is common in sea and estuarine waters. It produces several virulence factors and negatively affects larval penaeid shrimp in hatcheries, resulting in severe economic losses to shrimp aquaculture. Although V. harveyi is an important pathogen of shrimp, its pathogenicity mechanisms have yet to be completely elucidated. In the present study, isolates of V. harveyi were isolated and characterized from diseased Penaeus monodon postlarvae from hatcheries in Kerala, India, from September to December 2016. All 23 tested isolates were positive for lipase, phospholipase, caseinase, gelatinase and chitinase activity, and 3 of the isolates (MFB32, MFB71 and MFB68) showed potential for significant biofilm formation. Based on the presence of virulence genes, the isolates of V. harveyi were grouped into 6 genotypes, predominated by vhpA+ flaB+ ser+ vhh1- luxR+ vopD- vcrD+ vscN-. One isolate from each genotype was randomly selected for in vivo virulence experiments, and the LD50 ranged from 1.7 ± 0.5 × 103 to 4.1 ± 0.1 × 105 CFU ml-1. The expression of genes during the infection in postlarvae was high in 2 of the isolates (MFB12 and MFB32), consistent with the result of the challenge test. However, in MFB19, even though all genes tested were present, their expression level was very low and likely contributed to its lack of virulence. Because of the significant variation in gene expression, the presence of virulence genes alone cannot be used as a marker for pathogenicity of V. harveyi.

Download Full-text

Comprehensive Genome-Wide Association Analysis Reveals the Genetic Basis of Root System Architecture in Soybean

Frontiers in Plant Science ◽

10.3389/fpls.2020.590740 ◽

2020 ◽

Vol 11 ◽

Author(s):

Waldiodio Seck ◽

Davoud Torkamaneh ◽

François Belzile

Keyword(s):

Root System ◽

System Architecture ◽

Phenotypic Variation ◽

Genetic Basis ◽

Genome Wide Association Study ◽

Primary Root ◽

Root System Architecture ◽

Genome Wide Association ◽

Nucleotide Polymorphisms ◽

Genome Wide

Increasing the understanding genetic basis of the variability in root system architecture (RSA) is essential to improve resource-use efficiency in agriculture systems and to develop climate-resilient crop cultivars. Roots being underground, their direct observation and detailed characterization are challenging. Here, were characterized twelve RSA-related traits in a panel of 137 early maturing soybean lines (Canadian soybean core collection) using rhizoboxes and two-dimensional imaging. Significant phenotypic variation (P < 0.001) was observed among these lines for different RSA-related traits. This panel was genotyped with 2.18 million genome-wide single-nucleotide polymorphisms (SNPs) using a combination of genotyping-by-sequencing and whole-genome sequencing. A total of 10 quantitative trait locus (QTL) regions were detected for root total length and primary root diameter through a comprehensive genome-wide association study. These QTL regions explained from 15 to 25% of the phenotypic variation and contained two putative candidate genes with homology to genes previously reported to play a role in RSA in other species. These genes can serve to accelerate future efforts aimed to dissect genetic architecture of RSA and breed more resilient varieties.

Download Full-text

Divergence of a genomic island leads to the evolution of melanization in a halophyte root fungus

The ISME Journal ◽

10.1038/s41396-021-01023-8 ◽

2021 ◽

Author(s):

Zhilin Yuan ◽

Irina S. Druzhinina ◽

John G. Gibbons ◽

Zhenhui Zhong ◽

Yves Van de Peer ◽

...

Keyword(s):

Membrane Trafficking ◽

Molecular Mechanisms ◽

Genomic Island ◽

Population Genomics ◽

Growth Yield ◽

Fixation Index ◽

Nucleotide Polymorphisms ◽

Evolutionary Mechanisms ◽

Melanin Biosynthesis ◽

Two Populations

AbstractUnderstanding how organisms adapt to extreme living conditions is central to evolutionary biology. Dark septate endophytes (DSEs) constitute an important component of the root mycobiome and they are often able to alleviate host abiotic stresses. Here, we investigated the molecular mechanisms underlying the beneficial association between the DSE Laburnicola rhizohalophila and its host, the native halophyte Suaeda salsa, using population genomics. Based on genome-wide Fst (pairwise fixation index) and Vst analyses, which compared the variance in allele frequencies of single-nucleotide polymorphisms (SNPs) and copy number variants (CNVs), respectively, we found a high level of genetic differentiation between two populations. CNV patterns revealed population-specific expansions and contractions. Interestingly, we identified a ~20 kbp genomic island of high divergence with a strong sign of positive selection. This region contains a melanin-biosynthetic polyketide synthase gene cluster linked to six additional genes likely involved in biosynthesis, membrane trafficking, regulation, and localization of melanin. Differences in growth yield and melanin biosynthesis between the two populations grown under 2% NaCl stress suggested that this genomic island contributes to the observed differences in melanin accumulation. Our findings provide a better understanding of the genetic and evolutionary mechanisms underlying the adaptation to saline conditions of the L. rhizohalophila–S. salsa symbiosis.

Download Full-text

EXPRESS: Effect of genetic liability to visceral adiposity on stroke and its subtypes: a Mendelian randomization study

International Journal of Stroke ◽

10.1177/17474930211006285 ◽

2021 ◽

pp. 174749302110062

Author(s):

Bin Yan ◽

Jian Yang ◽

Li Qian ◽

Fengjie Gao ◽

Ling Bai ◽

...

Keyword(s):

Sensitivity Analysis ◽

Ischemic Stroke ◽

Genome Wide Association Study ◽

Mendelian Randomization ◽

Visceral Adiposity ◽

Large Artery ◽

Nucleotide Polymorphisms ◽

Genetic Liability ◽

A Genome ◽

Increased Risk

Background: Observational studies have found an association between visceral adiposity and stroke. Aims: The purpose of this study was to investigate the role and genetic effect of visceral adipose tissue (VAT) accumulation on stroke and its subtypes. Methods: In this two-sample Mendelian randomization (MR) study, genetic variants (221 single nucleotide polymorphisms; P<5Ã10-8) using as instrumental variables for MR analysis was obtained from a genome-wide association study (GWAS) of VAT. The outcome datasets for stroke and its subtypes were obtained from the MEGASTROKE consortium (up to 67,162 cases and 453,702 controls). MR standard analysis (inverse variance weighted method) was conducted to investigate the effect of genetic liability to visceral adiposity on stroke and its subtypes. Sensitivity analysis (MR-Egger, weighted median, MR-PRESSO) were also utilized to assess horizontal pleiotropy and remove outliers. Multi-variable MR analysis was employed to adjust potential confounders. Results: In the standard MR analysis, genetically determined visceral adiposity (per 1 SD) was significantly associated with a higher risk of stroke (odds ratio [OR] 1.30; 95% confidence interval [CI] 1.21-1.41, P=1.48Ã10-11), ischemic stroke (OR 1.30; 95% CI 1.20-1.41, P=4.01Ã10-10), and large artery stroke (OR 1.49; 95% CI 1.22-1.83, P=1.16Ã10-4). The significant association was also found in sensitivity analysis and multi-variable MR analysis. Conclusions: Genetic liability to visceral adiposity was significantly associated with an increased risk of stroke, ischemic stroke, and large artery stroke. The effect of genetic susceptibility to visceral adiposity on the stroke warrants further investigation.

Download Full-text

An artificial neural network approach integrating plasma proteomics and genetic data identifies PLXNA4 as a new susceptibility locus for pulmonary embolism

Scientific Reports ◽

10.1038/s41598-021-93390-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Misbah Razzaq ◽

Maria Jesus Iglesias ◽

Manal Ibrahim-Kosta ◽

Louisa Goumidi ◽

Omar Soukarieh ◽

...

Keyword(s):

Pulmonary Embolism ◽

Genome Wide Association Study ◽

Susceptibility Gene ◽

Biological Traits ◽

Nucleotide Polymorphisms ◽

Ann Model ◽

Neural Network Approach ◽

A Genome ◽

Increased Risk ◽

Artificial Neural

AbstractVenous thromboembolism is the third common cardiovascular disease and is composed of two entities, deep vein thrombosis (DVT) and its potential fatal form, pulmonary embolism (PE). While PE is observed in ~ 40% of patients with documented DVT, there is limited biomarkers that can help identifying patients at high PE risk. To fill this need, we implemented a two hidden-layers artificial neural networks (ANN) on 376 antibodies and 19 biological traits measured in the plasma of 1388 DVT patients, with or without PE, of the MARTHA study. We used the LIME algorithm to obtain a linear approximate of the resulting ANN prediction model. As MARTHA patients were typed for genotyping DNA arrays, a genome wide association study (GWAS) was conducted on the LIME estimate. Detected single nucleotide polymorphisms (SNPs) were tested for association with PE risk in MARTHA. Main findings were replicated in the EOVT study composed of 143 PE patients and 196 DVT only patients. The derived ANN model for PE achieved an accuracy of 0.89 and 0.79 in our training and testing sets, respectively. A GWAS on the LIME approximate identified a strong statistical association peak (rs1424597: p = 5.3 × 10–7) at the PLXNA4 locus. Homozygote carriers for the rs1424597-A allele were then more frequently observed in PE than in DVT patients from the MARTHA (2% vs. 0.4%, p = 0.005) and the EOVT (3% vs. 0%, p = 0.013) studies. In a sample of 112 COVID-19 patients known to have endotheliopathy leading to acute lung injury and an increased risk of PE, decreased PLXNA4 levels were associated (p = 0.025) with worsened respiratory function. Using an original integrated proteomics and genetics strategy, we identified PLXNA4 as a new susceptibility gene for PE whose exact role now needs to be further elucidated.

Download Full-text

Genome-wide association study in almost 195,000 individuals identifies 50 previously unidentified genetic loci for eye color

Science Advances ◽

10.1126/sciadv.abd1239 ◽

2021 ◽

Vol 7 (11) ◽

pp. eabd1239

Author(s):

Mark Simcoe ◽

Ana Valdes ◽

Fan Liu ◽

Nicholas A. Furlotte ◽

David M. Evans ◽

...

Keyword(s):

Association Study ◽

Genome Wide Association Study ◽

Color Variation ◽

Genome Wide Association ◽

Nucleotide Polymorphisms ◽

Melanin Pigmentation ◽

Eye Color ◽

Human Eye ◽

Genome Wide ◽

Genomic Regions

Human eye color is highly heritable, but its genetic architecture is not yet fully understood. We report the results of the largest genome-wide association study for eye color to date, involving up to 192,986 European participants from 10 populations. We identify 124 independent associations arising from 61 discrete genomic regions, including 50 previously unidentified. We find evidence for genes involved in melanin pigmentation, but we also find associations with genes involved in iris morphology and structure. Further analyses in 1636 Asian participants from two populations suggest that iris pigmentation variation in Asians is genetically similar to Europeans, albeit with smaller effect sizes. Our findings collectively explain 53.2% (95% confidence interval, 45.4 to 61.0%) of eye color variation using common single-nucleotide polymorphisms. Overall, our study outcomes demonstrate that the genetic complexity of human eye color considerably exceeds previous knowledge and expectations, highlighting eye color as a genetically highly complex human trait.

Download Full-text