scholarly journals Biological Machine Learning Combined with Campylobacter Population Genomics Reveals Virulence Gene Allelic Variants Cause Disease

2020 ◽  
Vol 8 (4) ◽  
pp. 549 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
Bart C. Weimer

Highly dimensional data generated from bacterial whole-genome sequencing is providing an unprecedented scale of information that requires an appropriate statistical analysis framework to infer biological function from populations of genomes. The application of genome-wide association study (GWAS) methods is an appropriate framework for bacterial population genome analysis that yields a list of candidate genes associated with a phenotype, but it provides an unranked measure of importance. Here, we validated a novel framework to define infection mechanism using the combination of GWAS, machine learning, and bacterial population genomics that ranked allelic variants that accurately identified disease. This approach parsed a dataset of 1.2 million single nucleotide polymorphisms (SNPs) and indels that resulted in an importance ranked list of associated alleles of porA in Campylobacter jejuni using spatiotemporal analysis over 30 years. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This framework, termed μPathML, defined intestinal and extraintestinal groups that have differential allelic porA variants that cause abortion. Divergent variants containing indels that defeated automated annotation were rescued using biological context and knowledge that resulted in defining rare, divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled with GWAS and population genomics to simultaneously identify and rank alleles to define their role in infectious disease mechanisms.

2019 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
Bart C. Weimer

AbstractHighly dimensional data generated from bacterial whole genome sequencing is providing unprecedented scale of information that requires appropriate statistical frameworks of analysis to infer biological function from bacterial genomic populations. Application of genome wide association study (GWAS) methods is an emerging approach with bacterial population genomics that yields a list of genes associated with a phenotype with an undefined importance among the candidates in the list. Here, we validate the combination of GWAS, machine learning, and pathogenic bacterial population genomics as a novel scheme to identify SNPs and rank allelic variants to determine associations for accurate estimation of disease phenotype. This approach parsed a dataset of 1.2 million SNPs that resulted in a ranked importance of associated alleles of Campylobacter jejuni porA using multiple spatial locations over a 30-year period. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This approach, termed BioML, defined intestinal and extraintestinal groups that have differential allelic variants that cause abortion. Divergent variants containing indels that defeated gene callers were rescued using biological context and knowledge that resulted in defining rare and divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled to GWAS and population genomics to simultaneously identify and rank alleles to define their role in abortion, and more broadly infectious disease.


2020 ◽  
Vol 8 (12) ◽  
pp. 2043
Author(s):  
Shawn M. Higdon ◽  
Bihua C. Huang ◽  
Alan B. Bennett ◽  
Bart C. Weimer

Sierra Mixe maize is a landrace variety from Oaxaca, Mexico, that utilizes nitrogen derived from the atmosphere via an undefined nitrogen fixation mechanism. The diazotrophic microbiota associated with the plant’s mucilaginous aerial root exudate composed of complex carbohydrates was previously identified and characterized by our group where we found 23 lactococci capable of biological nitrogen fixation (BNF) without containing any of the proposed essential genes for this trait (nifHDKENB). To determine the genes in Lactococcus associated with this phenotype, we selected 70 lactococci from the dairy industry that are not known to be diazotrophic to conduct a comparative population genomic analysis. This showed that the diazotrophic lactococcal genomes were distinctly different from the dairy isolates. Examining the pangenome followed by genome-wide association study and machine learning identified genes with the functions needed for BNF in the maize isolates that were absent from the dairy isolates. Many of the putative genes received an ‘unknown’ annotation, which led to the domain analysis of the 135 homologs. This revealed genes with molecular functions needed for BNF, including mucilage carbohydrate catabolism, glycan-mediated host adhesion, iron/siderophore utilization, and oxidation/reduction control. This is the first report of this pathway in this organism to underpin BNF. Consequently, we proposed a model needed for BNF in lactococci that plausibly accounts for BNF in the absence of the nif operon in this organism.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 15-16
Author(s):  
Pablo A S Fonseca ◽  
Massimo Tornatore ◽  
Angela Cánovas

Abstract Reduced fertility is one of the main causes of economic losses in dairy farms. The cost of a stillbirth is estimated in US$ 938 per case in Holstein herds. Machine learning (ML) is gaining popularity in the livestock sector as a mean to identify hidden patterns and due to its potential to address dimensionality problems. Here we investigate the application of ML algorithms for the prediction of cows with higher stillbirth susceptibility in two scenarios: cows with >25% and >33.33% of stillbirths among birth records. These thresholds correspond to percentiles 75 (still_75) and 90 (still_90), respectively. A total of 10,570 cows and 50,541 birth records were collected to perform a haplotype-based genome-wide association study. Five-hundred significant pseudo single nucleotide polymorphisms (pseudo-SNPs) (False-Discovery Rate< 0.05) were used as input features of ML-based predictions to determine if the cow is in the top-75 and top-90 percentiles. Table 1 shows the classification performance of the investigated ML and linear models. The ML models outperformed linear models for both thresholds. In general, still_75 showed higher F1 values compared to still_90, suggesting a lower misclassification ratio when a less stringent threshold is used. We observe that accuracy of the models in our study is higher when compared to ML-based prediction accuracies in other breeds, e.g. compared to the accuracies of 0.46 and 0.67 that were achieved using SNPs for body weight in Brahman and fertility traits in Nellore, respectively. Xgboost algorithm shows the highest balanced accuracy (BA; 0.625), F1-score (0.588) and area under the curve (AUC; 0.688), suggesting that xgboost can achieve the highest predictive performance and the lowest difference in misclassification ratio between classes. The ML applied over haplotype libraries is an interesting approach for the detection of animals with higher susceptibility to stillbirths due to highest predictive accuracy and relatively lower misclassification ratio.


2019 ◽  
Author(s):  
Katerina Placek ◽  
Michael Benatar ◽  
Joanne Wuu ◽  
Evadnie Rampersaud ◽  
Laura Hennessy ◽  
...  

AbstractAmyotrophic lateral sclerosis (ALS) is a multi-system disease characterized primarily by progressive muscle weakness. Cognitive dysfunction is commonly observed in patients, however factors influencing risk for cognitive dysfunction remain elusive. Using sparse canonical correlation analysis (sCCA), an unsupervised machine-learning technique, we observed that single nucleotide polymorphisms collectively associate with baseline cognitive performance in a large ALS patient cohort from the multicenter Clinical Research in ALS and Related Disorders for Therapeutic Development (CReATe) Consortium (N=327). We demonstrate that a polygenic risk score derived using sCCA relates to longitudinal cognitive decline in the same cohort, and also to in vivo cortical thinning in the orbital frontal cortex, anterior cingulate cortex, lateral temporal cortex, premotor cortex, and hippocampus (N=114) as well as post mortem motor cortical neuronal loss (N=88) in independent ALS cohorts from the University of Pennsylvania Integrated Neurodegenerative Disease Biobank. Our findings suggest that common genetic polymorphisms may exert a polygenic contribution to the risk of cortical disease vulnerability and cognitive dysfunction in ALS.


2020 ◽  
Vol 139 ◽  
pp. 153-160
Author(s):  
S Peeralil ◽  
TC Joseph ◽  
V Murugadas ◽  
PG Akhilnath ◽  
VN Sreejith ◽  
...  

Luminescent Vibrio harveyi is common in sea and estuarine waters. It produces several virulence factors and negatively affects larval penaeid shrimp in hatcheries, resulting in severe economic losses to shrimp aquaculture. Although V. harveyi is an important pathogen of shrimp, its pathogenicity mechanisms have yet to be completely elucidated. In the present study, isolates of V. harveyi were isolated and characterized from diseased Penaeus monodon postlarvae from hatcheries in Kerala, India, from September to December 2016. All 23 tested isolates were positive for lipase, phospholipase, caseinase, gelatinase and chitinase activity, and 3 of the isolates (MFB32, MFB71 and MFB68) showed potential for significant biofilm formation. Based on the presence of virulence genes, the isolates of V. harveyi were grouped into 6 genotypes, predominated by vhpA+ flaB+ ser+ vhh1- luxR+ vopD- vcrD+ vscN-. One isolate from each genotype was randomly selected for in vivo virulence experiments, and the LD50 ranged from 1.7 ± 0.5 × 103 to 4.1 ± 0.1 × 105 CFU ml-1. The expression of genes during the infection in postlarvae was high in 2 of the isolates (MFB12 and MFB32), consistent with the result of the challenge test. However, in MFB19, even though all genes tested were present, their expression level was very low and likely contributed to its lack of virulence. Because of the significant variation in gene expression, the presence of virulence genes alone cannot be used as a marker for pathogenicity of V. harveyi.


2020 ◽  
Vol 11 ◽  
Author(s):  
Waldiodio Seck ◽  
Davoud Torkamaneh ◽  
François Belzile

Increasing the understanding genetic basis of the variability in root system architecture (RSA) is essential to improve resource-use efficiency in agriculture systems and to develop climate-resilient crop cultivars. Roots being underground, their direct observation and detailed characterization are challenging. Here, were characterized twelve RSA-related traits in a panel of 137 early maturing soybean lines (Canadian soybean core collection) using rhizoboxes and two-dimensional imaging. Significant phenotypic variation (P < 0.001) was observed among these lines for different RSA-related traits. This panel was genotyped with 2.18 million genome-wide single-nucleotide polymorphisms (SNPs) using a combination of genotyping-by-sequencing and whole-genome sequencing. A total of 10 quantitative trait locus (QTL) regions were detected for root total length and primary root diameter through a comprehensive genome-wide association study. These QTL regions explained from 15 to 25% of the phenotypic variation and contained two putative candidate genes with homology to genes previously reported to play a role in RSA in other species. These genes can serve to accelerate future efforts aimed to dissect genetic architecture of RSA and breed more resilient varieties.


2021 ◽  
Author(s):  
Zhilin Yuan ◽  
Irina S. Druzhinina ◽  
John G. Gibbons ◽  
Zhenhui Zhong ◽  
Yves Van de Peer ◽  
...  

AbstractUnderstanding how organisms adapt to extreme living conditions is central to evolutionary biology. Dark septate endophytes (DSEs) constitute an important component of the root mycobiome and they are often able to alleviate host abiotic stresses. Here, we investigated the molecular mechanisms underlying the beneficial association between the DSE Laburnicola rhizohalophila and its host, the native halophyte Suaeda salsa, using population genomics. Based on genome-wide Fst (pairwise fixation index) and Vst analyses, which compared the variance in allele frequencies of single-nucleotide polymorphisms (SNPs) and copy number variants (CNVs), respectively, we found a high level of genetic differentiation between two populations. CNV patterns revealed population-specific expansions and contractions. Interestingly, we identified a ~20 kbp genomic island of high divergence with a strong sign of positive selection. This region contains a melanin-biosynthetic polyketide synthase gene cluster linked to six additional genes likely involved in biosynthesis, membrane trafficking, regulation, and localization of melanin. Differences in growth yield and melanin biosynthesis between the two populations grown under 2% NaCl stress suggested that this genomic island contributes to the observed differences in melanin accumulation. Our findings provide a better understanding of the genetic and evolutionary mechanisms underlying the adaptation to saline conditions of the L. rhizohalophila–S. salsa symbiosis.


2021 ◽  
pp. 174749302110062
Author(s):  
Bin Yan ◽  
Jian Yang ◽  
Li Qian ◽  
Fengjie Gao ◽  
Ling Bai ◽  
...  

Background: Observational studies have found an association between visceral adiposity and stroke. Aims: The purpose of this study was to investigate the role and genetic effect of visceral adipose tissue (VAT) accumulation on stroke and its subtypes. Methods: In this two-sample Mendelian randomization (MR) study, genetic variants (221 single nucleotide polymorphisms; P<5×10-8) using as instrumental variables for MR analysis was obtained from a genome-wide association study (GWAS) of VAT. The outcome datasets for stroke and its subtypes were obtained from the MEGASTROKE consortium (up to 67,162 cases and 453,702 controls). MR standard analysis (inverse variance weighted method) was conducted to investigate the effect of genetic liability to visceral adiposity on stroke and its subtypes. Sensitivity analysis (MR-Egger, weighted median, MR-PRESSO) were also utilized to assess horizontal pleiotropy and remove outliers. Multi-variable MR analysis was employed to adjust potential confounders. Results: In the standard MR analysis, genetically determined visceral adiposity (per 1 SD) was significantly associated with a higher risk of stroke (odds ratio [OR] 1.30; 95% confidence interval [CI] 1.21-1.41, P=1.48×10-11), ischemic stroke (OR 1.30; 95% CI 1.20-1.41, P=4.01×10-10), and large artery stroke (OR 1.49; 95% CI 1.22-1.83, P=1.16×10-4). The significant association was also found in sensitivity analysis and multi-variable MR analysis. Conclusions: Genetic liability to visceral adiposity was significantly associated with an increased risk of stroke, ischemic stroke, and large artery stroke. The effect of genetic susceptibility to visceral adiposity on the stroke warrants further investigation.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Misbah Razzaq ◽  
Maria Jesus Iglesias ◽  
Manal Ibrahim-Kosta ◽  
Louisa Goumidi ◽  
Omar Soukarieh ◽  
...  

AbstractVenous thromboembolism is the third common cardiovascular disease and is composed of two entities, deep vein thrombosis (DVT) and its potential fatal form, pulmonary embolism (PE). While PE is observed in ~ 40% of patients with documented DVT, there is limited biomarkers that can help identifying patients at high PE risk. To fill this need, we implemented a two hidden-layers artificial neural networks (ANN) on 376 antibodies and 19 biological traits measured in the plasma of 1388 DVT patients, with or without PE, of the MARTHA study. We used the LIME algorithm to obtain a linear approximate of the resulting ANN prediction model. As MARTHA patients were typed for genotyping DNA arrays, a genome wide association study (GWAS) was conducted on the LIME estimate. Detected single nucleotide polymorphisms (SNPs) were tested for association with PE risk in MARTHA. Main findings were replicated in the EOVT study composed of 143 PE patients and 196 DVT only patients. The derived ANN model for PE achieved an accuracy of 0.89 and 0.79 in our training and testing sets, respectively. A GWAS on the LIME approximate identified a strong statistical association peak (rs1424597: p = 5.3 × 10–7) at the PLXNA4 locus. Homozygote carriers for the rs1424597-A allele were then more frequently observed in PE than in DVT patients from the MARTHA (2% vs. 0.4%, p = 0.005) and the EOVT (3% vs. 0%, p = 0.013) studies. In a sample of 112 COVID-19 patients known to have endotheliopathy leading to acute lung injury and an increased risk of PE, decreased PLXNA4 levels were associated (p = 0.025) with worsened respiratory function. Using an original integrated proteomics and genetics strategy, we identified PLXNA4 as a new susceptibility gene for PE whose exact role now needs to be further elucidated.


2021 ◽  
Vol 7 (11) ◽  
pp. eabd1239
Author(s):  
Mark Simcoe ◽  
Ana Valdes ◽  
Fan Liu ◽  
Nicholas A. Furlotte ◽  
David M. Evans ◽  
...  

Human eye color is highly heritable, but its genetic architecture is not yet fully understood. We report the results of the largest genome-wide association study for eye color to date, involving up to 192,986 European participants from 10 populations. We identify 124 independent associations arising from 61 discrete genomic regions, including 50 previously unidentified. We find evidence for genes involved in melanin pigmentation, but we also find associations with genes involved in iris morphology and structure. Further analyses in 1636 Asian participants from two populations suggest that iris pigmentation variation in Asians is genetically similar to Europeans, albeit with smaller effect sizes. Our findings collectively explain 53.2% (95% confidence interval, 45.4 to 61.0%) of eye color variation using common single-nucleotide polymorphisms. Overall, our study outcomes demonstrate that the genetic complexity of human eye color considerably exceeds previous knowledge and expectations, highlighting eye color as a genetically highly complex human trait.


Sign in / Sign up

Export Citation Format

Share Document