Pathogenicity and selective constraint on variation near splice sites

Mapping Intimacies ◽

10.1101/256636 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jenny Lord ◽

Giuseppe Gallone ◽

Patrick J. Short ◽

Jeremy F. McRae ◽

Holly Ironfield ◽

...

Keyword(s):

Splice Site ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Selective Constraint ◽

Splice Sites ◽

Sequencing Data ◽

Pathogenic Variants ◽

Unbiased Manner ◽

Functional Relevance

AbstractMutations which perturb normal pre-mRNA splicing are significant contributors to human disease. We used exome sequencing data from 7,833 probands with developmental disorders (DD) and their unaffected parents, as well as >60,000 aggregated exomes from the Exome Aggregation Consortium, to investigate selection around the splice site, and quantify the contribution of splicing mutations to DDs. Patterns of purifying selection, a deficit of variants in highly constrained genes in healthy subjects and excess de novo mutations in patients highlighted particular positions within and around the consensus splice site of greater functional relevance. Using mutational burden analyses in this large cohort of proband-parent trios, we could estimate in an unbiased manner the relative contributions of mutations at canonical dinucleotides (73%) and flanking non-canonical positions (27%), and calculated the positive predictive value of pathogenicity for different classes of mutations. We identified 18 patients with likely diagnostic de novo mutations in dominant DD-associated genes at non-canonical positions in splice sites. We estimate 35-40% of pathogenic variants in non-canonical splice site positions are missing from public databases.

Download Full-text

Contribution of retrotransposition to developmental disorders

Nature Communications ◽

10.1038/s41467-019-12520-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 10

Author(s):

Eugene J. Gardner ◽

Elena Prigmore ◽

Giuseppe Gallone ◽

Petr Danecek ◽

Kaitlin E. Samocha ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Selective Constraint ◽

Protein Coding ◽

Genome Wide ◽

De Novo Gene ◽

The Impact ◽

Transcribed Sequences

Abstract Mobile genetic Elements (MEs) are segments of DNA which can copy themselves and other transcribed sequences through the process of retrotransposition (RT). In humans several disorders have been attributed to RT, but the role of RT in severe developmental disorders (DD) has not yet been explored. Here we identify RT-derived events in 9738 exome sequenced trios with DD-affected probands. We ascertain 9 de novo MEs, 4 of which are likely causative of the patient’s symptoms (0.04%), as well as 2 de novo gene retroduplications. Beyond identifying likely diagnostic RT events, we estimate genome-wide germline ME mutation rate and selective constraint and demonstrate that coding RT events have signatures of purifying selection equivalent to those of truncating mutations. Overall, our analysis represents a comprehensive interrogation of the impact of retrotransposition on protein coding genes and a framework for future evolutionary and disease studies.

Download Full-text

Evolutionary analysis across mammals reveals distinct classes of long noncoding RNAs

10.1101/031385 ◽

2015 ◽

Author(s):

Jenny Chen ◽

Alexander A. Shishkin ◽

Xiaopeng Zhu ◽

Sabah Kadri ◽

Itay Maza ◽

...

Keyword(s):

Transcriptome Sequencing ◽

De Novo ◽

Functional Characterization ◽

Purifying Selection ◽

Selective Constraint ◽

Evolutionary Analysis ◽

Sequencing Data ◽

Manual Curation ◽

First Time

BACKGROUND: Recent advances in transcriptome sequencing have enabled the discovery of thousands of long non-coding RNAs (lncRNAs) across multitudes of species. Though several lncRNAs have been shown to play important roles in diverse biological processes, the functions and mechanisms of most lncRNAs remain unknown. Two significant obstacles lie between transcriptome sequencing and functional characterization of lncRNAs: 1) identifying truly noncoding genes from de novo reconstructed transcriptomes, and 2) prioritizing hundreds of resulting putative lncRNAs from each sample for downstream experimental interrogation. RESULTS: We present slncky, a computational lncRNA discovery tool that produces a high-quality set of lncRNAs from RNA-Sequencing data and further prioritizes lncRNAs by characterizing selective constraint as a proxy for function. Our filtering pipeline is comparable to manual curation efforts and more sensitive than previously published approaches. Further, we develop, for the first time, a sensitive alignment pipeline for aligning lncRNA loci and propose new evolutionary metrics relevant for both sequence and transcript evolution. Our analysis reveals that selection acts in several distinct patterns, and uncovers two notable classes of lncRNAs: one showing strong purifying selection at RNA sequence and another where constraint is restricted to the regulation but not the sequence of the transcript. CONCLUSION: Our novel comparative methods for lncRNAs reveals 233 constrained lncRNAs out of tens of thousands of currently annotated transcripts, which we believe should be prioritized for further interrogation. To aid in their analysis we provide the slncky Evolution Browser as a resource for experimentalists.

Download Full-text

A map of constrained coding regions in the human genome

10.1101/220814 ◽

2017 ◽

Cited By ~ 8

Author(s):

James M. Havrilla ◽

Brent S. Pedersen ◽

Ryan M. Layer ◽

Aaron R. Quinlan

Keyword(s):

Human Genome ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Protein Domain ◽

De Novo Mutations ◽

Protein Coding ◽

Constrained Coding ◽

Coding Regions ◽

Pathogenic Variants

ABSTRACTDeep catalogs of genetic variation collected from many thousands of humans enable the detection of intraspecies constraint by revealing coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single metrics cannot capture the fine-scale variability in constraint within each protein-coding gene. To provide greater resolution, we have created a detailed map of constrained coding regions (CCRs) in the human genome by leveraging coding variation observed among 123,136 humans from the Genome Aggregation Database (gnomAD). The most constrained coding regions in our map are enriched for both pathogenic variants in ClinVar and de novo mutations underlying developmental disorders. CCRs also reveal protein domain families under high constraint, suggest unannotated or incomplete protein domains, and facilitate the prioritization of previously unseen variation in studies of disease. Finally, a subset of CCRs with the highest constraint likely exist within genes that cause yet unobserved human phenotypes owing to strong purifying selection.

Download Full-text

MED12-Related (Neuro)Developmental Disorders: A Question of Causality

Genes ◽

10.3390/genes12050663 ◽

2021 ◽

Vol 12 (5) ◽

pp. 663

Author(s):

Stijn van de Plassche ◽

Arjan PM de Brouwer

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Expression Patterns ◽

Mediator Complex ◽

Gene Expression Patterns ◽

Facial Dysmorphism ◽

Regulation Of Transcription ◽

Feeding Difficulties ◽

Missense Variants ◽

Pathogenic Variants

MED12 is a member of the Mediator complex that is involved in the regulation of transcription. Missense variants in MED12 cause FG syndrome, Lujan-Fryns syndrome, and Ohdo syndrome, as well as non-syndromic intellectual disability (ID) in hemizygous males. Recently, female patients with de novo missense variants and de novo protein truncating variants in MED12 were described, resulting in a clinical spectrum centered around ID and Hardikar syndrome without ID. The missense variants are found throughout MED12, whether they are inherited in hemizygous males or de novo in females. They can result in syndromic or nonsyndromic ID. The de novo nonsense variants resulting in Hardikar syndrome that is characterized by facial clefting, pigmentary retinopathy, biliary anomalies, and intestinal malrotation, are found more N-terminally, whereas the more C-terminally positioned variants are de novo protein truncating variants that cause a severe, syndromic phenotype consisting of ID, facial dysmorphism, short stature, skeletal abnormalities, feeding difficulties, and variable other abnormalities. This broad range of distinct phenotypes calls for a method to distinguish between pathogenic and non-pathogenic variants in MED12. We propose an isogenic iNeuron model to establish the unique gene expression patterns that are associated with the specific MED12 variants. The discovery of these patterns would help in future diagnostics and determine the causality of the MED12 variants.

Download Full-text

Identification of UBAP1 mutations in juvenile hereditary spastic paraplegia in the 100,000 Genomes Project

European Journal of Human Genetics ◽

10.1038/s41431-020-00720-w ◽

2020 ◽

Vol 28 (12) ◽

pp. 1763-1768

Author(s):

Thomas Bourinaris ◽

◽

Damian Smedley ◽

Valentina Cipriani ◽

Isabella Sheikh ◽

...

Keyword(s):

Hereditary Spastic Paraplegia ◽

De Novo ◽

Age At Onset ◽

Genetic Diagnosis ◽

Spastic Paraplegia ◽

Sequencing Data ◽

Juvenile Form ◽

Pathogenic Variants ◽

Degenerative Disorders ◽

Significant Gene

AbstractHereditary spastic paraplegia (HSP) is a group of heterogeneous inherited degenerative disorders characterized by lower limb spasticity. Fifty percent of HSP patients remain yet genetically undiagnosed. The 100,000 Genomes Project (100KGP) is a large UK-wide initiative to provide genetic diagnosis to previously undiagnosed patients and families with rare conditions. Over 400 HSP families were recruited to the 100KGP. In order to obtain genetic diagnoses, gene-based burden testing was carried out for rare, predicted pathogenic variants using candidate variants from the Exomiser analysis of the genome sequencing data. A significant gene-disease association was identified for UBAP1 and HSP. Three protein truncating variants were identified in 13 patients from 7 families. All patients presented with juvenile form of pure HSP, with median age at onset 10 years, showing autosomal dominant inheritance or de novo occurrence. Additional clinical features included parkinsonism and learning difficulties, but their association with UBAP1 needs to be established.

Download Full-text

Inherited variants in CHD3 demonstrate variable expressivity in Snijders Blok-Campeau syndrome

10.1101/2021.10.04.21264162 ◽

2021 ◽

Author(s):

Jet van der Spek ◽

Joery den Hoed ◽

Lot Snijders Blok ◽

Alexander J. M. Dingemans ◽

Dick Schijven ◽

...

Keyword(s):

De Novo ◽

Neurodevelopmental Disorder ◽

Underlying Mechanism ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Human Phenotype ◽

Variable Expressivity ◽

Pathogenic Variants ◽

Reduced Penetrance ◽

Coding Variants

Interpretation of next-generation sequencing data of individuals with an apparent sporadic neurodevelopmental disorder (NDD) often focusses on pathogenic variants in genes associated with NDD, assuming full clinical penetrance with limited variable expressivity. Consequently, inherited variants in genes associated with dominant disorders may be overlooked when the transmitting parent is clinically unaffected. While de novo variants explain a substantial proportion of cases with NDDs, a significant number remains undiagnosed possibly explained by coding variants associated with reduced penetrance and variable expressivity. We characterized twenty families with inherited heterozygous missense or protein-truncating variants (PTVs) in CHD3, a gene in which de novo variants cause Snijders Blok-Campeau syndrome, characterized by intellectual disability, speech delay and recognizable facial features (SNIBCPS). Notably, the majority of the inherited CHD3 variants were maternally transmitted. Computational facial and human phenotype ontology-based comparisons demonstrated that the phenotypic features of probands with inherited CHD3 variants overlap with the phenotype previously associated with de novo variants in the gene, while carrier parents are mildly or not affected, suggesting variable expressivity. Additionally, similarly reduced expression levels of CHD3 protein in cells of an affected proband and of related healthy carriers with a CHD3 PTV, suggested that compensation of expression from the wildtype allele is unlikely to be an underlying mechanism. Our results point to a significant role of inherited variation in SNIBCPS, a finding that is critical for correct variant interpretation and genetic counseling and warrants further investigation towards understanding the broader contributions of such variation to the landscape of human disease.

Download Full-text

De novo pathogenic variant in SETX causes a rapidly progressive neurodegenerative disorder of early childhood-onset with severe axonal polyneuropathy

Acta Neuropathologica Communications ◽

10.1186/s40478-021-01277-5 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Aristides Hadjinicolaou ◽

Kathie J. Ngo ◽

Daniel Y. Conway ◽

John P. Provias ◽

Steven K. Baker ◽

...

Keyword(s):

Network Analysis ◽

De Novo ◽

Neurodegenerative Disorder ◽

Neurological Diseases ◽

Patient Data ◽

Loss Of Function ◽

Sequencing Data ◽

Gene Variation ◽

Transcriptional Signature ◽

Pathogenic Variants

AbstractPathogenic variants in SETX cause two distinct neurological diseases, a loss-of-function recessive disorder, ataxia with oculomotor apraxia type 2 (AOA2), and a dominant gain-of-function motor neuron disorder, amyotrophic lateral sclerosis type 4 (ALS4). We identified two unrelated patients with the same de novo c.23C > T (p.Thr8Met) variant in SETX presenting with an early-onset, severe polyneuropathy. As rare private gene variation is often difficult to link to genetic neurological disease by DNA sequence alone, we used transcriptional network analysis to functionally validate these patients with severe de novo SETX-related neurodegenerative disorder. Weighted gene co-expression network analysis (WGCNA) was used to identify disease-associated modules from two different ALS4 mouse models and compared to confirmed ALS4 patient data to derive an ALS4-specific transcriptional signature. WGCNA of whole blood RNA-sequencing data from a patient with the p.Thr8Met SETX variant was compared to ALS4 and control patients to determine if this signature could be used to identify affected patients. WGCNA identified overlapping disease-associated modules in ALS4 mouse model data and ALS4 patient data. Mouse ALS4 disease-associated modules were not associated with AOA2 disease modules, confirming distinct disease-specific signatures. The expression profile of a patient carrying the c.23C > T (p.Thr8Met) variant was significantly associated with the human and mouse ALS4 signature, confirming the relationship between this SETX variant and disease. The similar clinical presentations of the two unrelated patients with the same de novo p.Thr8Met variant and the functional data provide strong evidence that the p.Thr8Met variant is pathogenic. The distinct phenotype expands the clinical spectrum of SETX-related disorders.

Download Full-text

Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility

10.1101/2020.03.23.003277 ◽

2020 ◽

Author(s):

Pedro Manuel Martínez-García ◽

Miguel García-Torres ◽

Federico Divina ◽

José Terrón-Bautista ◽

Irene Delgado-Sainz ◽

...

Keyword(s):

Machine Learning ◽

Developmental Disorders ◽

Topoisomerase Ii ◽

Catalytic Mechanism ◽

De Novo ◽

Deep Understanding ◽

Genome Integrity ◽

Sequencing Data ◽

Genome Wide ◽

Genome Dynamics

AbstractDNA topoisomerase II-β (TOP2B) is fundamental to remove topological problems linked to DNA metabolism and 3D chromatin architecture, but its cut-and-reseal catalytic mechanism can accidentally cause DNA double-strand breaks (DSBs) that can seriously compromise genome integrity. Understanding the factors that determine the genome-wide distribution of TOP2B is therefore not only essential for a complete knowledge of genome dynamics and organization, but also for the implications of TOP2-induced DSBs in the origin of oncogenic translocations and other types of chromosomal rearrangements. Here, we conduct a machine-learning approach for the prediction of TOP2B binding sites using publicly available sequencing data. We achieve highly accurate predictions, with accessible chromatin and architectural factors being the most informative features. Strikingly, TOP2B is sufficiently explained by only three features: DNase I hypersensitivity, CTCF and cohesin binding, for which genome-wide data are widely available. Based on this, we develop a predictive model for TOP2B genome-wide binding that can be used across cell lines and species, and generate virtual probability tracks that accurately mirror experimental ChIP-seq data. Our results deepen our knowledge on how the accessibility and 3D organization of chromatin determine TOP2B function, and constitute a proof of principle regarding the in silico prediction of sequence-independent chromatin-binding factors.Author summaryType II DNA topoisomerases (TOP2) are a double-edged sword. They solve topological problems in the form of supercoiling, knots and tangles that inevitably accompany genome metabolism, but they do so at the cost of transiently cleaving DNA, with the risk that this entails for genome integrity, and the serious consequences for human health, such as neurodegeneration, developmental disorders or predisposition to cancer. A comprehensive analysis of TOP2 distribution throughout the genome is therefore essential for a deep understanding of its function and regulation, and how this can affect genome dynamics and stability. Here, we use machine learning to thoroughly explore genome-wide binding of TOP2B, a vertebrate TOP2 paralog that has been linked to genome organization and cancer-associated translocations. Our analysis shows that TOP2B-DNA binding can be accurately predicted exclusively using information on DNA accessibility and binding of genome-architecture factors. We show that such information is enough to generate virtual maps of TOP2B binding along the genome, which we validate with de novo experimental data. Our results highlight the importance of TOP2B for accessibility and 3D organization of chromatin, and show that computationally predicted TOP2 maps can be accurately obtained using minimal publicly available datasets, opening the door for their use in different organisms, cell types and conditions with experimental and/or clinical relevance.

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text

5’ splice site GC>GT variants differ from GT>GC variants in terms of their functionality and pathogenicity

10.1101/829010 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jin-Huan Lin ◽

Emmanuelle Masson ◽

Arnaud Boulling ◽

Matthew Hayden ◽

David N. Cooper ◽

...

Keyword(s):

Splice Site ◽

Meta Analysis ◽

Proof Of Concept ◽

Splice Sites ◽

Gene Splicing ◽

Small Minority ◽

Pathogenic Variants ◽

Mammalian Genomes ◽

A Cell ◽

Full Length Gene

ABSTRACTIn the human genome, most 5’ splice sites (~99%) employ the canonical GT dinucleotide whereas a small minority (~1%) use the non-canonical GC dinucleotide. The functionality and pathogenicity of 5’ splice site GT>GC (i.e., +2T>C) variants have been extensively studied but we still know very little about 5’ splice site GC>GT (+2C>T) variants. Herein, we sought to address this deficiency by performing a meta-analysis of identified +2C>T pathogenic variants together with a functional analysis of +2C>T substitutions using a cell culture-based full-length gene splicing assay. Our results establish a proof of concept that +2C>T variants are qualitatively different from +2T>C variants in terms of their functionality and pathogenicity and suggest that, in sharp contrast with +2T>C variants, most if not all +2C>T variants have no pathological relevance. Our findings have important implications for interpreting the clinical relevance of +2C>T variants but might also improve our understanding of the evolutionary basis of switching between GT and GC 5’ splice sites in mammalian genomes.

Download Full-text