Flexible Mixture Model Approaches That Accommodate Footprint Size Variability for Robust Detection of Balancing Selection

Xiaoheng Cheng; Michael DeGiorgio

doi:10.1093/molbev/msaa134

Flexible Mixture Model Approaches That Accommodate Footprint Size Variability for Robust Detection of Balancing Selection

Molecular Biology and Evolution ◽

10.1093/molbev/msaa134 ◽

2020 ◽

Vol 37 (11) ◽

pp. 3267-3291 ◽

Cited By ~ 1

Author(s):

Xiaoheng Cheng ◽

Michael DeGiorgio

Keyword(s):

Pain Perception ◽

Balancing Selection ◽

Genomic Data ◽

Composite Likelihood ◽

Ratio Test ◽

Data Set ◽

Robust Detection ◽

Population Genomic ◽

Genomic Regions

Abstract Long-term balancing selection typically leaves narrow footprints of increased genetic diversity, and therefore most detection approaches only achieve optimal performances when sufficiently small genomic regions (i.e., windows) are examined. Such methods are sensitive to window sizes and suffer substantial losses in power when windows are large. Here, we employ mixture models to construct a set of five composite likelihood ratio test statistics, which we collectively term B statistics. These statistics are agnostic to window sizes and can operate on diverse forms of input data. Through simulations, we show that they exhibit comparable power to the best-performing current methods, and retain substantially high power regardless of window sizes. They also display considerable robustness to high mutation rates and uneven recombination landscapes, as well as an array of other common confounding scenarios. Moreover, we applied a specific version of the B statistics, termed B2, to a human population-genomic data set and recovered many top candidates from prior studies, including the then-uncharacterized STPG2 and CCDC169–SOHLH2, both of which are related to gamete functions. We further applied B2 on a bonobo population-genomic data set. In addition to the MHC-DQ genes, we uncovered several novel candidate genes, such as KLRD1, involved in viral defense, and SCN9A, associated with pain perception. Finally, we show that our methods can be extended to account for multiallelic balancing selection and integrated the set of statistics into open-source software named BalLeRMix for future applications by the scientific community.

Download Full-text

Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection

10.1101/645887 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xiaoheng Cheng ◽

Michael DeGiorgio

Keyword(s):

Pain Perception ◽

Balancing Selection ◽

Composite Likelihood ◽

Ratio Test ◽

Genomic Dataset ◽

Robust Detection ◽

Population Genomic ◽

Size Variability ◽

Genomic Regions

AbstractLong-term balancing selection typically leaves narrow footprints of increased genetic diversity, and therefore most detection approaches only achieve optimal performances when sufficiently small genomic regions (i.e., windows) are examined. Such methods are sensitive to window sizes and suffer substantial losses in power when windows are large. This issue creates a tradeoff between noise and power in empirical applications. Here, we employ mixture models to construct a set of five composite likelihood ratio test statistics, which we collectively termBstatistics. These statistics are agnostic to window sizes and can operate on diverse forms of input data. Through simulations, we show that they exhibit comparable power to the best-performing current methods, and retain substantially high power regardless of window sizes. They also display considerable robustness to high mutation rates and uneven recombination landscapes, as well as an array of other common confounding scenarios. Moreover, we applied a specific version of theBstatistics, termedB2, to a human population-genomic dataset and recovered many top candidates from prior studies, including the then-uncharacterizedSTPG2andCCDC169-SOHLH2, both of which are related to gamete functions. We further appliedB2on a bonobo population-genomic dataset. In addition to theMHC-DQgenes, we uncovered several novel candidate genes, such asKLRD1, involved in viral defense, andSCN9A, associated with pain perception. Finally, we show that our methods can be extended to account for multi-allelic balancing selection, and integrated the set of statistics into open-source software namedBalLeRMixfor future applications by the scientific community.

Download Full-text

A new test suggests that balancing selection maintains hundreds of non-synonymous polymorphisms in the human genome

10.1101/2021.02.08.430226 ◽

2021 ◽

Cited By ~ 1

Author(s):

Vivak Soni ◽

Michiel Vos ◽

Adam Eyre-Walker

Keyword(s):

Genetic Diversity ◽

Human Genome ◽

Balancing Selection ◽

Genomic Data ◽

Demographic Changes ◽

Simple Test ◽

Direct Estimate ◽

Population Genomic

AbstractThe role that balancing selection plays in the maintenance of genetic diversity remains unresolved. Here we introduce a new test, based on the McDonald-Kreitman test, in which the number of polymorphisms that are shared between populations is contrasted to those that are private at selected and neutral sites. We show that this simple test is robust to a variety of demographic changes, and that it can also give a direct estimate of the number of shared polymorphisms that are directly maintained by balancing selection. We apply our method to population genomic data from humans and conclude that more than a thousand non-synonymous polymorphisms are subject to balancing selection.

Download Full-text

Multiple targets of balancing selection in Leishmania donovani complex parasites

10.1101/2021.03.02.433528 ◽

2021 ◽

Author(s):

Cooper Alastair Grace ◽

Sarah Forrester ◽

Vladimir Costa Silva ◽

Aleksander Aare ◽

Hannah Kilford ◽

...

Keyword(s):

Candidate Genes ◽

Species Complex ◽

Leishmania Donovani ◽

Balancing Selection ◽

Genomic Data ◽

Causative Agents ◽

Signatures Of Selection ◽

Attachment Proteins ◽

Multiple Metrics

AbstractThe Leishmania donovani species complex are the causative agents of visceral leishmaniasis, which cause 20-40,000 fatalities a year. Here, we conduct a screen for balancing selection in this specie complex. We sequence 93 isolates of L. infantum from Brazil and used 387 publicly-available L. donovani and L. infantum genomes, to describe the global diversity of this species complex. We identify five genetically-distinct populations that are sufficiently represented by genomic data to search for signatures of selection. We show that multiple metrics identify genes with robust signatures of balancing selection. We produce a curated set of 19 genes with robust signatures, including zeta toxin, nodulin-like and flagellum attachment proteins. Candidate genes were generally not shared between populations, consistent with divergent rather than long-term balancing selection in these species. This study highlights the extent of genetic divergence between L. donovani complex parasites and provides candidate genes for further study.

Download Full-text

Quantifying adaptive evolution and the effects of natural selection across the Norway spruce genome

10.1101/2020.06.25.170902 ◽

2020 ◽

Author(s):

Xi Wang ◽

Pär K Ingvarsson

Keyword(s):

Natural Selection ◽

Norway Spruce ◽

Balancing Selection ◽

Enrichment Analysis ◽

Coding Regions ◽

Regulatory Changes ◽

Whole Genomes ◽

Adaptive Rate ◽

Genomic Regions

AbstractDetecting natural selection is one of the major goals of evolutionary genomics. Here, we sequence whole genomes of 34 Picea abies individuals and quantify the amount of selection across the genome. Using an estimate of the distribution of fitness effects, we show that negative selection is very limited in coding regions, while positive selection is rare in coding regions but very strong in non-coding regions, suggesting the great importance of regulatory changes in evolution of Norway spruce. Additionally, we found a positive correlation between adaptive rate with recombination rate and a negative correlation between adaptive rate and gene density, suggesting a widespread influence from Hill-Robertson interference to efficiency of protein adaptation in P. abies. Finally, the distinct population statistics between genomic regions under either positive or balancing selection with that under neutral regions indicated impact from selection to genomic architecture of Norway spruce. Further gene ontology enrichment analysis for genes located in regions identified as undergoing either positive or long-term balancing selection also highlighted specific molecular functions and biological processes in that appear to be targets of selection in Norway spruce.

Download Full-text

Speciation and introgression between Mimulus nasutus and Mimulus guttatus

10.1101/000109 ◽

2013 ◽

Cited By ~ 1

Author(s):

Yaniv Brandvain ◽

Amanda M Kenney ◽

Lex Fagel ◽

Graham Coop ◽

Andrea L Sweigart

Keyword(s):

Ecological Model ◽

Negative Relationship ◽

Genomic Data ◽

Mimulus Guttatus ◽

Species Pair ◽

Data Set ◽

Genomic Signatures ◽

Population Genomic ◽

History Of ◽

Biological Differentiation

Mimulus guttatus and M. nasutus are an evolutionary and ecological model sister species pair differentiated by ecology, mating system, and partial reproductive isolation. Despite extensive research on this system, the history of divergence and differentiation in this sister pair is unclear. We present and analyze a novel population genomic data set which shows that M. nasutus "budded" off of a central Californian M. guttatus population within the last 200 to 500 thousand years. In this time, the M. nasutus genome has accrued numerous genomic signatures of the transition to predominant selfing. Despite clear biological differentiation, we document ongoing, bidirectional introgression. We observe a negative relationship between the recombination rate and divergence between M. nasutus and sympatric M. guttatus samples, suggesting that selection acts against M. nasutus ancestry in M. guttatus.

Download Full-text

The analysis of epigenomic evolution

10.1101/2021.03.03.433796 ◽

2021 ◽

Author(s):

Arne Sahm ◽

Philipp Koch ◽

Steve Horvath ◽

Steve Hoffmann

Keyword(s):

Binding Sites ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

Systematic Investigation ◽

Data Set ◽

Evolutionarily Conserved ◽

Immune Related Genes ◽

Genomic Regions ◽

Blood Data

While the investigation of the epigenome becomes increasingly important, still little is known about the long-term evolution of epigenetic marks and systematic investigation strategies are still withstanding. Here, we systematically demonstrate the transfer of classic phylogenetic methods such as maximum likelihood based on substitution models, parsimony, and distance-based to interval-scaled epigenetic data (available at Github). Using a great apes blood data set, we demonstrate that DNA methylation is evolutionarily conserved at the level of individual CpGs in promotors, enhancers and genic regions. Our analysis also reveals that this epigenomic conservation is significantly correlated with its transcription factor binding density. Binding sites for transcription factors involved in neuron differentiation and components of AP-1 evolve at a significantly higher rate at methylation than at nucleotide level. Moreover, our models suggest an accelerated epigenomic evolution at binding sites of BRCA1, CBX2, and factors of the polycomb repressor 2 complex in humans. For most genomic regions, the methylation-based reconstruction of phylogenetic trees is at par with sequence-based reconstruction. Most strikingly, phylogenetic reconstruction using methylation rates in enhancer regions was ineffective independently of the chosen model. We identify a set of phylogenetically uninformative CpG sites enriching in enhancers controlling immune-related genes.

Download Full-text

Identifying clinically relevant prognostic subgroups in node-positive postmenopausal HR+ early breast cancer patients treated with endocrine therapy: A combined analysis of 2,485 patients from ABCSG-8 and ATAC using the PAM50 risk of recurrence (ROR) score and intrinsic subtype.

Journal of Clinical Oncology ◽

10.1200/jco.2013.31.15_suppl.506 ◽

2013 ◽

Vol 31 (15_suppl) ◽

pp. 506-506 ◽

Cited By ~ 7

Author(s):

Michael Gnant ◽

Mitchell Dowsett ◽

Martin Filipits ◽

Elena Lopez-Knowles ◽

Richard Greil ◽

...

Keyword(s):

Adjuvant Chemotherapy ◽

Endocrine Therapy ◽

Recurrence Risk ◽

Distant Recurrence ◽

Ratio Test ◽

Prognostic Information ◽

Data Set ◽

Combined Analysis ◽

Node Positive

506 Background: Most postmenopausal women with node positive HR+ EBC receive adjuvant chemotherapy. We hypothesized that a molecular-based characterization of residual risk after endocrine therapy using the ROR score and IS may identify node-positive patient subgroups with limited long-term recurrence risk after endocrine therapy better than clinical-pathological risk assessment by clinical treatment score (CTS) alone. Methods: Long-term follow-up and tissue samples were obtained from 2,485 postmenopausal HR+ patients from the ABCSG-8 (N=1,478) and transATAC (N=1,007) trials. The PAM50 test was conducted on RNA extracted from paraffin blocks using the NanoString nCounter Analysis system. The ability of ROR, IS and ROR-defined risk groups (ROR-RG) to add prognostic information to CTS was assessed by the likelihood ratio test in a prospectively defined analysis plan. Results: Patients in the combined data set were grouped by the number of positive nodes into 1 (N1), 2 (N2), or 2 or 3 (N2-3),Baseline hazards for these subgroups were similar in the two trials. ROR score, IS and ROR-RG added statistically significant prognostic information (10-year distant recurrence risk) beyond CTS in all groups. In patients with one positive node, the absolute 10-year risk of distant recurrence was 6.6% [95% CI: 3.3%-12.8%] in the PAM-50-low risk group (40% of patients) and 8.4 % [5.3%-13.3%] in the Luminal A subgroup (69% of patients). Conclusions: The results of this combined analysis demonstrate that a significant proportion of N1 EBC patients have very limited long term recurrence risk and suggest the same for some N2 patients. The PAM50 ROR score, IS and ROR-RG reliably provide additional prognostic information beyond CTS and may be useful in deciding which women with node-positive HR+ EBC can be spared adjuvant chemotherapy. [Table: see text]

Download Full-text

A Composite-Likelihood Method for Detecting Incomplete Selective Sweep from Population Genomic Data

Genetics ◽

10.1534/genetics.115.175380 ◽

2015 ◽

Vol 200 (2) ◽

pp. 633-649 ◽

Cited By ~ 21

Author(s):

Ha My T. Vy ◽

Yuseob Kim

Keyword(s):

Selective Sweep ◽

Genomic Data ◽

Composite Likelihood ◽

Likelihood Method ◽

Population Genomic

Download Full-text

Detection of shared balancing selection in the absence of trans-species polymorphism

10.1101/320390 ◽

2018 ◽

Author(s):

Xiaoheng Cheng ◽

Michael DeGiorgio

Keyword(s):

Balancing Selection ◽

Single Species ◽

Ease Of Use ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Model Based ◽

Higher Power ◽

Multiple Species ◽

Genomic Regions

AbstractTrans-species polymorphism has been widely used as a key sign of long-term balancing selection across multiple species. However, such sites are often rare in the genome, and could result from mutational processes or technical artifacts. Few methods are yet available to specifically detect footprints of trans-species balancing selection without using trans-species polymorphic sites. In this study, we develop summary- and model-based approaches that are each specifically tailored to uncover regions of long-term balancing selection shared by a set of species by using genomic patterns of intra-specific polymorphism and inter-specific fixed differences. We demonstrate that our trans-species statistics have substantially higher power than single-species approaches to detect footprints of trans-species balancing selection, and are robust to those that do not affect all tested species. We further apply our model-based methods to human and chimpanzee whole genome sequencing data. In addition to the previously-established MHC and malaria resistance-associated FREM3/GYPE regions, we also find outstanding genomic regions involved in barrier integrity and innate immunity, such as the GRIK1/CLDN17 intergenic region, and the SLC35F1 and ABCA13 genes. Our findings not only echo the significance of pathogen defense, but also reveal novel candidates in maintaining balanced polymorphisms across human and chimpanzee lineages. Finally, we show that these trans-species statistics can be applied to and work well for an arbitrary number of species, and integrate them into open-source software packages for ease of use by the scientific community.

Download Full-text

Analysis of variability and long-term trends of sea surface temperature over the China Seas derived from a newly merged regional data set

Climate Research ◽

10.3354/cr01471 ◽

2017 ◽

Vol 73 (3) ◽

pp. 217-231 ◽

Cited By ~ 2

Author(s):

Y Li ◽

L Mu ◽

Y Liu ◽

G Wang ◽

D Zhang ◽

...

Keyword(s):

Surface Temperature ◽

Sea Surface Temperature ◽

Sea Surface ◽

China Seas ◽

Regional Data ◽

Data Set ◽

Long Term Trends ◽

The China Seas

Download Full-text