Identifying genetic determinants of complex phenotypes from whole genome sequence data

Mapping Intimacies ◽

10.1101/181222 ◽

2017 ◽

Cited By ~ 1

Author(s):

George S. Long ◽

Mohammed Hussen ◽

Jonathan Dench ◽

Stéphane Aris-Brosou

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Association Studies ◽

Machine Learning Algorithms ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Genetic Determinants ◽

Data Set ◽

Adaptive Boosting ◽

Complex Phenotypes

AbstractA critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (in-fectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than RF, it was never < 50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.

Download Full-text

Multiple similarly effective solutions exist for biomedical feature selection and classification problems

Scientific Reports ◽

10.1038/s41598-017-13184-8 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 9

Author(s):

Jiamei Liu ◽

Cheng Xu ◽

Weifeng Yang ◽

Yayun Shu ◽

Weiwei Zheng ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Association Studies ◽

Binary Classification ◽

Learning Algorithms ◽

Optimal Solution ◽

Machine Learning Algorithms ◽

Disease Classification ◽

Genome Wide Association Studies ◽

Classification Problems

Abstract Binary classification is a widely employed problem to facilitate the decisions on various biomedical big data questions, such as clinical drug trials between treated participants and controls, and genome-wide association studies (GWASs) between participants with or without a phenotype. A machine learning model is trained for this purpose by optimizing the power of discriminating samples from two groups. However, most of the classification algorithms tend to generate one locally optimal solution according to the input dataset and the mathematical presumptions of the dataset. Here we demonstrated from the aspects of both disease classification and feature selection that multiple different solutions may have similar classification performances. So the existing machine learning algorithms may have ignored a horde of fishes by catching only a good one. Since most of the existing machine learning algorithms generate a solution by optimizing a mathematical goal, it may be essential for understanding the biological mechanisms for the investigated classification question, by considering both the generated solution and the ignored ones.

Download Full-text

Linkage disequilibrium maps for European and African populations constructed from whole genome sequence data

Scientific Data ◽

10.1038/s41597-019-0227-y ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 1

Author(s):

Alejandra Vergara-Lope ◽

M. Reza Jabalameli ◽

Clare Horscroft ◽

Sarah Ennis ◽

Andrew Collins ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Genome Sequence ◽

Sequence Data ◽

Association Studies ◽

Large Population ◽

Whole Genome Sequence ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Whole Genome ◽

Genome Wide

Abstract Quantification of linkage disequilibrium (LD) patterns in the human genome is essential for genome-wide association studies, selection signature mapping and studies of recombination. Whole genome sequence (WGS) data provides optimal source data for this quantification as it is free from biases introduced by the design of array genotyping platforms. The Malécot-Morton model of LD allows the creation of a cumulative map for each choromosome, analogous to an LD form of a linkage map. Here we report LD maps generated from WGS data for a large population of European ancestry, as well as populations of Baganda, Ethiopian and Zulu ancestry. We achieve high average genetic marker densities of 2.3–4.6/kb. These maps show good agreement with prior, low resolution maps and are consistent between populations. Files are provided in BED format to allow researchers to readily utilise this resource.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

Briefings in Bioinformatics ◽

10.1093/bib/bbz041 ◽

2019 ◽

Vol 21 (3) ◽

pp. 1047-1057 ◽

Cited By ~ 57

Author(s):

Zhen Chen ◽

Pei Zhao ◽

Fuyi Li ◽

Tatiana T Marquez-Lago ◽

André Leier ◽

...

Keyword(s):

Machine Learning ◽

Dimensionality Reduction ◽

Sequence Data ◽

Machine Learning Algorithms ◽

User Friendliness ◽

Data Set ◽

Protein Sequence Data ◽

Learning Analysis ◽

High Throughput Manner ◽

Online Web

Abstract With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Download Full-text

A conditional multi-trait sequence GWAS discovers pleiotropic candidate genes and variants for sheep wool, skin wrinkle and breech cover traits

Genetics Selection Evolution ◽

10.1186/s12711-021-00651-0 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Sunduimijid Bolormaa ◽

Andrew A. Swan ◽

Paul Stothard ◽

Majid Khansefid ◽

Nasir Moghaddar ◽

...

Keyword(s):

Candidate Genes ◽

Genome Sequence ◽

Sequence Data ◽

Association Studies ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Whole Genome ◽

Merino Wool ◽

Causal Variants ◽

Wool Proteins

Abstract Background Imputation to whole-genome sequence is now possible in large sheep populations. It is therefore of interest to use this data in genome-wide association studies (GWAS) to investigate putative causal variants and genes that underpin economically important traits. Merino wool is globally sought after for luxury fabrics, but some key wool quality attributes are unfavourably correlated with the characteristic skin wrinkle of Merinos. In turn, skin wrinkle is strongly linked to susceptibility to “fly strike” (Cutaneous myiasis), which is a major welfare issue. Here, we use whole-genome sequence data in a multi-trait GWAS to identify pleiotropic putative causal variants and genes associated with changes in key wool traits and skin wrinkle. Results A stepwise conditional multi-trait GWAS (CM-GWAS) identified putative causal variants and related genes from 178 independent quantitative trait loci (QTL) of 16 wool and skin wrinkle traits, measured on up to 7218 Merino sheep with 31 million imputed whole-genome sequence (WGS) genotypes. Novel candidate gene findings included the MAT1A gene that encodes an enzyme involved in the sulphur metabolism pathway critical to production of wool proteins, and the ESRP1 gene. We also discovered a significant wrinkle variant upstream of the HAS2 gene, which in dogs is associated with the exaggerated skin folds in the Shar-Pei breed. Conclusions The wool and skin wrinkle traits studied here appear to be highly polygenic with many putative candidate variants showing considerable pleiotropy. Our CM-GWAS identified many highly plausible candidate genes for wool traits as well as breech wrinkle and breech area wool cover.

Download Full-text

SeqBreed: a python tool to evaluate genomic prediction in complex scenarios

10.1101/748624 ◽

2019 ◽

Author(s):

M. Pérez-Enciso ◽

L. C. Ramírez-Ayala ◽

L.M. Zingaretti

Keyword(s):

Genomic Prediction ◽

Predictive Accuracy ◽

Sequence Data ◽

Association Studies ◽

Single Step ◽

Genome Wide Association ◽

Drosophila Genome ◽

Genome Wide Association Studies ◽

Complex Phenotypes ◽

Genome Wide

AbstractBackgroundGenomic Prediction (GP) is the procedure whereby molecular information is used to predict complex phenotypes. Although GP can significantly enhance predictive accuracy, it can be expensive and difficult to implement. To help in designing optimum experiments, including genome wide association studies and genomic selection experiments, we have developed SeqBreed, a generic and flexible python3 forward simulator.ResultsSeqBreed accommodates sex and mitochondrion chromosomes as well as autopolyploidy. It can simulate any number of complex phenotypes determined by any number of causal loci. SeqBreed implements several GP methods, including single step GBLUP. We demonstrate its functionality with Drosophila Genome Reference Panel (DGRP) sequence data and with tetraploid potato genotypes.ConclusionsSeqBreed is a flexible and easy to use tool appropriate for optimizing GP or genome wide association studies. It incorporates some of the most popular GP methods and includes several visualization tools. Code is open and can be freely modified. Software, documentation and examples are available at https://github.com/miguelperezenciso/SeqBreed.

Download Full-text

Machine learning methods applied to genotyping data capture interactions between single nucleotide variants in late onset Alzheimer's disease

10.1101/2021.08.30.21262815 ◽

2021 ◽

Author(s):

Magdalena Arnal Segura ◽

Dietmar Fernandez ◽

Claudia Giambartolomei ◽

Giorgio Bini ◽

Eleftherios Samaras ◽

...

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Late Onset ◽

Hot Spot ◽

Association Studies ◽

Machine Learning Algorithms ◽

Genome Wide Association Studies ◽

Single Nucleotide Variants ◽

Single Nucleotide

INTRODUCTION Genome-wide association studies (GWAS) in late onset Alzheimer's disease (LOAD) provide lists of individual genetic determinants. However, GWAS are not good at capturing the synergistic effects among multiple genetic variants and lack good specificity. METHODS We applied tree-based machine learning algorithms (MLs) to discriminate LOAD (> 700 individuals) and age-matched unaffected subjects using single nucleotide variants (SNVs) from AD studies, obtaining specific genomic profiles with the prioritized SNVs. RESULTS The MLs prioritized a set of SNVs located in close proximity genes PVRL2, TOMM40, APOE and APOC1. The captured genomic profiles in this region showed a clear interaction between rs405509 and rs1160985. Additionally, rs405509 located in APOE promoter interacts with rs429358 among others, seemingly neutralizing their predisposing effect. Interactions are characterized by their association with specific comorbidities and the presence of eQTL and sQTLs. DISCUSSION Our approach efficiently discriminates LOAD from controls, capturing genomic profiles defined by interactions among SNVs in a hot-spot region.

Download Full-text

Comparing power and precision of within-breed and multibreed genome-wide association studies of production traits using whole-genome sequence data for 5 French and Danish dairy cattle breeds

Journal of Dairy Science ◽

10.3168/jds.2016-11073 ◽

2016 ◽

Vol 99 (11) ◽

pp. 8932-8945 ◽

Cited By ~ 21

Author(s):

Irene van den Berg ◽

Didier Boichard ◽

Mogens Sandø Lund

Keyword(s):

Dairy Cattle ◽

Sequence Data ◽

Association Studies ◽

Genome Wide Association ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Whole Genome ◽

Production Traits ◽

Genome Wide ◽

Danish Dairy

Download Full-text

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

GigaScience ◽

10.1093/gigascience/giaa077 ◽

2020 ◽

Vol 9 (8) ◽

Author(s):

Arash Bayat ◽

Piotr Szul ◽

Aidan R O’Brien ◽

Robert Dunne ◽

Brendan Hosking ◽

...

Keyword(s):

Machine Learning ◽

Association Studies ◽

Genomic Data ◽

Genome Wide Association ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Epistatic Interactions ◽

Genomic Variants ◽

Complex Phenotypes ◽

Genome Wide

Abstract Background Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. Findings We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. Conclusions Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.

Download Full-text

Increasing calling accuracy, coverage, and read-depth in sequence data by the use of haplotype blocks

PLoS Genetics ◽

10.1371/journal.pgen.1009944 ◽

2021 ◽

Vol 17 (12) ◽

pp. e1009944

Author(s):

Torsten Pook ◽

Adnane Nemri ◽

Eric Gerardo Gonzalez Segovia ◽

Daniel Valle Torres ◽

Henner Simianer ◽

...

Keyword(s):

Data Quality ◽

Genomic Prediction ◽

Sequence Data ◽

Association Studies ◽

Genomic Data ◽

Read Depth ◽

Error Rates ◽

Whole Genome Sequence ◽

Genome Wide Association Studies ◽

Haplotype Blocks

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline (“HBimpute”) that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.

Download Full-text