S/HIC: Robust identification of soft and hard sweeps using machine learning

Mapping Intimacies ◽

10.1101/024547 ◽

2015 ◽

Cited By ~ 1

Author(s):

Daniel R. Schrider ◽

Andrew D. Kern

Keyword(s):

Machine Learning ◽

Population Sample ◽

Natural Populations ◽

Supervised Machine Learning ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Demographic Model ◽

Selective Sweeps ◽

Sequencing Data ◽

Standing Variation

ABSTRACTDetecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.

Download Full-text

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn’t

Genes ◽

10.3390/genes12040527 ◽

2021 ◽

Vol 12 (4) ◽

pp. 527

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

A Priori ◽

Neutral Theory ◽

Dominant Mode ◽

Supervised Machine Learning ◽

Training Dataset ◽

Selective Sweeps ◽

Two Factors ◽

Negative Controls

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Download Full-text

Hierarchical boosting: a machine-learning framework to detect and classify hard selective sweeps in human populations

Bioinformatics ◽

10.1093/bioinformatics/btv493 ◽

2015 ◽

pp. btv493 ◽

Cited By ~ 29

Author(s):

Marc Pybus ◽

Pierre Luisi ◽

Giovanni Marco Dall'Olio ◽

Manu Uzkudun ◽

Hafid Laayouni ◽

...

Keyword(s):

Machine Learning ◽

Human Populations ◽

Selective Sweeps ◽

Learning Framework

Download Full-text

Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees

Journal of Nucleic Acids ◽

10.1155/2012/652979 ◽

2012 ◽

Vol 2012 ◽

pp. 1-10 ◽

Cited By ~ 14

Author(s):

Philip H. Williams ◽

Rod Eyles ◽

Georg Weiller

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Mature Mirnas ◽

Read Count ◽

Supervised Machine Learning ◽

Sequencing Data ◽

Tree Model ◽

Rigorous Testing ◽

Plant Mirna ◽

Leave One Out

MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require “read count” to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA∗duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

Download Full-text

Machine Learning Predicts Accurately Mycobacterium tuberculosis Drug Resistance From Whole Genome Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2019.00922 ◽

2019 ◽

Vol 10 ◽

Cited By ~ 7

Author(s):

Wouter Deelder ◽

Sofia Christakoudi ◽

Jody Phelan ◽

Ernest Diez Benavente ◽

Susana Campino ◽

...

Keyword(s):

Machine Learning ◽

Drug Resistance ◽

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Drainage-structuring of ancestral variation and a common functional pathway shape limited genomic convergence in natural high- and low-predation guppies

10.1101/2020.10.14.339333 ◽

2020 ◽

Author(s):

James R. Whiting ◽

Josephine R. Paris ◽

Mijke J. van der Zee ◽

Paul J. Parsons ◽

Detlef Weigel ◽

...

Keyword(s):

Poecilia Reticulata ◽

Natural Populations ◽

Whole Genome Sequencing Data ◽

Phenotypic Change ◽

Whole Genome ◽

Sequencing Data ◽

Adaptive Variation ◽

Genome Level ◽

Functional Pathway ◽

Three Rivers

ABSTRACTStudies of convergence in wild populations have been instrumental in understanding adaptation by providing strong evidence for natural selection. At the genetic level, we are beginning to appreciate that the re-use of the same genes in adaptation occurs through different mechanisms and can be constrained by underlying trait architectures and demographic characteristics of natural populations. Here, we explore these processes in naturally adapted high- (HP) and low-predation (LP) populations of the Trinidadian guppy, Poecilia reticulata. As a model for phenotypic change this system provided some of the earliest evidence of rapid and repeatable evolution in vertebrates; the genetic basis of which has yet to be studied at the whole-genome level. We collected whole-genome sequencing data from ten populations (176 individuals) representing five independent HP-LP river pairs across the three main drainages in Northern Trinidad. We evaluate population structure, uncovering several LP bottlenecks and variable between-river introgression that can lead to constraints on the sharing of adaptive variation between populations. Consequently, we found limited selection on common genes or loci across all drainages. Using a pathway type analysis, however, we find evidence of repeated selection on different genes involved in cadherin signalling. Finally, we found a large repeatedly selected haplotype on chromosome 20 in three rivers from the same drainage. Taken together, despite limited sharing of adaptive variation among rivers, we found evidence of convergent evolution associated with HP-LP environments in pathways across divergent drainages and at a previously unreported candidate haplotype within a drainage.

Download Full-text

Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007349 ◽

2019 ◽

Vol 15 (9) ◽

pp. e1007349 ◽

Cited By ~ 20

Author(s):

Allison L. Hicks ◽

Nicole Wheeler ◽

Leonor Sánchez-Busó ◽

Jennifer L. Rakeman ◽

Simon R. Harris ◽

...

Keyword(s):

Machine Learning ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Antibiotic Susceptibility ◽

Susceptibility Testing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Antibiotic Susceptibility Testing ◽

Sequencing Data ◽

Evaluation Of Parameters

Download Full-text

A Likelihood Approach for Uncovering Selective Sweep Signatures from Haplotype Data

Molecular Biology and Evolution ◽

10.1093/molbev/msaa115 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3023-3046

Author(s):

Alexandre M Harris ◽

Michael DeGiorgio

Keyword(s):

Selective Sweep ◽

Natural Populations ◽

Haplotype Frequency ◽

Human Populations ◽

Ratio Test ◽

Whole Genome ◽

Selective Sweeps ◽

Test Statistic ◽

Haplotypic Diversity ◽

Haplotype Data

Abstract Selective sweeps are frequent and varied signatures in the genomes of natural populations, and detecting them is consequently important in understanding mechanisms of adaptation by natural selection. Following a selective sweep, haplotypic diversity surrounding the site under selection decreases, and this deviation from the background pattern of variation can be applied to identify sweeps. Multiple methods exist to locate selective sweeps in the genome from haplotype data, but none leverages the power of a model-based approach to make their inference. Here, we propose a likelihood ratio test statistic T to probe whole-genome polymorphism data sets for selective sweep signatures. Our framework uses a simple but powerful model of haplotype frequency spectrum distortion to find sweeps and additionally make an inference on the number of presently sweeping haplotypes in a population. We found that the T statistic is suitable for detecting both hard and soft sweeps across a variety of demographic models, selection strengths, and ages of the beneficial allele. Accordingly, we applied the T statistic to variant calls from European and sub-Saharan African human populations, yielding primarily literature-supported candidates, including LCT, RSPH3, and ZNF211 in CEU, SYT1, RGS18, and NNT in YRI, and HLA genes in both populations. We also searched for sweep signatures in Drosophila melanogaster, finding expected candidates at Ace, Uhg1, and Pimet. Finally, we provide open-source software to compute the T statistic and the inferred number of presently sweeping haplotypes from whole-genome data.

Download Full-text

Exploring the Occurrence of Classic Selective Sweeps in Humans Using Whole-Genome Sequencing Data Sets

Molecular Biology and Evolution ◽

10.1093/molbev/msu118 ◽

2014 ◽

Vol 31 (7) ◽

pp. 1850-1868 ◽

Cited By ~ 52

Author(s):

Maud Fagny ◽

Etienne Patin ◽

David Enard ◽

Luis B. Barreiro ◽

Lluis Quintana-Murci ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Data Sets ◽

Whole Genome ◽

Selective Sweeps ◽

Sequencing Data

Download Full-text

TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data

10.1101/791665 ◽

2019 ◽

Cited By ~ 1

Author(s):

Clement Goubert ◽

Jainy Thomas ◽

Lindsay M. Payer ◽

Jeffrey M. Kidd ◽

Julie Feusier ◽

...

Keyword(s):

Population Genomics ◽

Mobile Element ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Whole Genome ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Standard Set ◽

Whole Genome Resequencing

ABSTRACTAlu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alu are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alu and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline -- TypeTE -- which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a ‘gold standard’ set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.

Download Full-text

Drainage-structuring of ancestral variation and a common functional pathway shape limited genomic convergence in natural high- and low-predation guppies

PLoS Genetics ◽

10.1371/journal.pgen.1009566 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1009566

Author(s):

James R. Whiting ◽

Josephine R. Paris ◽

Mijke J. van der Zee ◽

Paul J. Parsons ◽

Detlef Weigel ◽

...

Keyword(s):

Poecilia Reticulata ◽

Natural Populations ◽

Whole Genome Sequencing Data ◽

Phenotypic Change ◽

Whole Genome ◽

Sequencing Data ◽

Adaptive Variation ◽

Genome Level ◽

Functional Pathway ◽

Three Rivers

Studies of convergence in wild populations have been instrumental in understanding adaptation by providing strong evidence for natural selection. At the genetic level, we are beginning to appreciate that the re-use of the same genes in adaptation occurs through different mechanisms and can be constrained by underlying trait architectures and demographic characteristics of natural populations. Here, we explore these processes in naturally adapted high- (HP) and low-predation (LP) populations of the Trinidadian guppy, Poecilia reticulata. As a model for phenotypic change this system provided some of the earliest evidence of rapid and repeatable evolution in vertebrates; the genetic basis of which has yet to be studied at the whole-genome level. We collected whole-genome sequencing data from ten populations (176 individuals) representing five independent HP-LP river pairs across the three main drainages in Northern Trinidad. We evaluate population structure, uncovering several LP bottlenecks and variable between-river introgression that can lead to constraints on the sharing of adaptive variation between populations. Consequently, we found limited selection on common genes or loci across all drainages. Using a pathway type analysis, however, we find evidence of repeated selection on different genes involved in cadherin signaling. Finally, we found a large repeatedly selected haplotype on chromosome 20 in three rivers from the same drainage. Taken together, despite limited sharing of adaptive variation among rivers, we found evidence of convergent evolution associated with HP-LP environments in pathways across divergent drainages and at a previously unreported candidate haplotype within a drainage.

Download Full-text