A map of constrained coding regions in the human genome

Mapping Intimacies ◽

10.1101/220814 ◽

2017 ◽

Cited By ~ 8

Author(s):

James M. Havrilla ◽

Brent S. Pedersen ◽

Ryan M. Layer ◽

Aaron R. Quinlan

Keyword(s):

Human Genome ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Protein Domain ◽

De Novo Mutations ◽

Protein Coding ◽

Constrained Coding ◽

Coding Regions ◽

Pathogenic Variants

ABSTRACTDeep catalogs of genetic variation collected from many thousands of humans enable the detection of intraspecies constraint by revealing coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single metrics cannot capture the fine-scale variability in constraint within each protein-coding gene. To provide greater resolution, we have created a detailed map of constrained coding regions (CCRs) in the human genome by leveraging coding variation observed among 123,136 humans from the Genome Aggregation Database (gnomAD). The most constrained coding regions in our map are enriched for both pathogenic variants in ClinVar and de novo mutations underlying developmental disorders. CCRs also reveal protein domain families under high constraint, suggest unannotated or incomplete protein domains, and facilitate the prioritization of previously unseen variation in studies of disease. Finally, a subset of CCRs with the highest constraint likely exist within genes that cause yet unobserved human phenotypes owing to strong purifying selection.

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text

Contribution of retrotransposition to developmental disorders

Nature Communications ◽

10.1038/s41467-019-12520-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 10

Author(s):

Eugene J. Gardner ◽

Elena Prigmore ◽

Giuseppe Gallone ◽

Petr Danecek ◽

Kaitlin E. Samocha ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Selective Constraint ◽

Protein Coding ◽

Genome Wide ◽

De Novo Gene ◽

The Impact ◽

Transcribed Sequences

Abstract Mobile genetic Elements (MEs) are segments of DNA which can copy themselves and other transcribed sequences through the process of retrotransposition (RT). In humans several disorders have been attributed to RT, but the role of RT in severe developmental disorders (DD) has not yet been explored. Here we identify RT-derived events in 9738 exome sequenced trios with DD-affected probands. We ascertain 9 de novo MEs, 4 of which are likely causative of the patient’s symptoms (0.04%), as well as 2 de novo gene retroduplications. Beyond identifying likely diagnostic RT events, we estimate genome-wide germline ME mutation rate and selective constraint and demonstrate that coding RT events have signatures of purifying selection equivalent to those of truncating mutations. Overall, our analysis represents a comprehensive interrogation of the impact of retrotransposition on protein coding genes and a framework for future evolutionary and disease studies.

Download Full-text

Contribution of Retrotransposition to Developmental Disorders

10.1101/471375 ◽

2018 ◽

Cited By ~ 2

Author(s):

Eugene J. Gardner ◽

Elena Prigmore ◽

Giuseppe Gallone ◽

Petr Danecek ◽

Kaitlin E. Samocha ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Mobile Genetic Elements ◽

Protein Coding ◽

Protein Coding Genes ◽

Genome Wide ◽

The Impact ◽

Transcribed Sequences

AbstractMobile genetic Elements (MEs) are segments of DNA which, through an RNA intermediate, can generate new copies of themselves and other transcribed sequences through the process of retrotransposition (RT). In humans several disorders have been attributed to RT, but the role of RT in severe developmental disorders (DD) has not yet been explored. As such, we have identified RT-derived events in 9,738 exome sequenced trios with DD-affected probands as part of the Deciphering Developmental Disorders (DDD) study. We have ascertained 9 de novo MEs, 4 of which are likely causative of the patient’s symptoms (0.04% of probands), as well as 2 de novo gene retroduplications. Beyond identifying likely diagnostic RT events, we have estimated genome-wide germline ME mutagenesis and constraint and demonstrated that coding RT events have signatures of purifying selection equivalent to those of truncating mutations. Overall, our analysis represents a comprehensive interrogation of the impact of retrotransposition on protein coding genes and a framework for future evolutionary and disease studies.

Download Full-text

Pathogenicity and selective constraint on variation near splice sites

10.1101/256636 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jenny Lord ◽

Giuseppe Gallone ◽

Patrick J. Short ◽

Jeremy F. McRae ◽

Holly Ironfield ◽

...

Keyword(s):

Splice Site ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Selective Constraint ◽

Splice Sites ◽

Sequencing Data ◽

Pathogenic Variants ◽

Unbiased Manner ◽

Functional Relevance

AbstractMutations which perturb normal pre-mRNA splicing are significant contributors to human disease. We used exome sequencing data from 7,833 probands with developmental disorders (DD) and their unaffected parents, as well as >60,000 aggregated exomes from the Exome Aggregation Consortium, to investigate selection around the splice site, and quantify the contribution of splicing mutations to DDs. Patterns of purifying selection, a deficit of variants in highly constrained genes in healthy subjects and excess de novo mutations in patients highlighted particular positions within and around the consensus splice site of greater functional relevance. Using mutational burden analyses in this large cohort of proband-parent trios, we could estimate in an unbiased manner the relative contributions of mutations at canonical dinucleotides (73%) and flanking non-canonical positions (27%), and calculated the positive predictive value of pathogenicity for different classes of mutations. We identified 18 patients with likely diagnostic de novo mutations in dominant DD-associated genes at non-canonical positions in splice sites. We estimate 35-40% of pathogenic variants in non-canonical splice site positions are missing from public databases.

Download Full-text

MED12-Related (Neuro)Developmental Disorders: A Question of Causality

Genes ◽

10.3390/genes12050663 ◽

2021 ◽

Vol 12 (5) ◽

pp. 663

Author(s):

Stijn van de Plassche ◽

Arjan PM de Brouwer

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Expression Patterns ◽

Mediator Complex ◽

Gene Expression Patterns ◽

Facial Dysmorphism ◽

Regulation Of Transcription ◽

Feeding Difficulties ◽

Missense Variants ◽

Pathogenic Variants

MED12 is a member of the Mediator complex that is involved in the regulation of transcription. Missense variants in MED12 cause FG syndrome, Lujan-Fryns syndrome, and Ohdo syndrome, as well as non-syndromic intellectual disability (ID) in hemizygous males. Recently, female patients with de novo missense variants and de novo protein truncating variants in MED12 were described, resulting in a clinical spectrum centered around ID and Hardikar syndrome without ID. The missense variants are found throughout MED12, whether they are inherited in hemizygous males or de novo in females. They can result in syndromic or nonsyndromic ID. The de novo nonsense variants resulting in Hardikar syndrome that is characterized by facial clefting, pigmentary retinopathy, biliary anomalies, and intestinal malrotation, are found more N-terminally, whereas the more C-terminally positioned variants are de novo protein truncating variants that cause a severe, syndromic phenotype consisting of ID, facial dysmorphism, short stature, skeletal abnormalities, feeding difficulties, and variable other abnormalities. This broad range of distinct phenotypes calls for a method to distinguish between pathogenic and non-pathogenic variants in MED12. We propose an isogenic iNeuron model to establish the unique gene expression patterns that are associated with the specific MED12 variants. The discovery of these patterns would help in future diagnostics and determine the causality of the MED12 variants.

Download Full-text

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text

Integrating healthcare and research genetic data empowers the discovery of 28 novel developmental disorders

10.1101/797787 ◽

2019 ◽

Cited By ~ 14

Author(s):

Joanna Kaplanis ◽

Kaitlin E. Samocha ◽

Laurens Wiel ◽

Zhancheng Zhang ◽

Kevin J. Arvai ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Genetic Data ◽

Statistical Test ◽

Integrated Healthcare ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Diagnostic ◽

Simulation Based

SummaryDe novo mutations (DNMs) in protein-coding genes are a well-established cause of developmental disorders (DD). However, known DD-associated genes only account for a minority of the observed excess of such DNMs. To identify novel DD-associated genes, we integrated healthcare and research exome sequences on 31,058 DD parent-offspring trios, and developed a simulation-based statistical test to identify gene-specific enrichments of DNMs. We identified 285 significantly DD-associated genes, including 28 not previously robustly associated with DDs. Despite detecting more DD-associated genes than in any previous study, much of the excess of DNMs of protein-coding genes remains unaccounted for. Modelling suggests that over 1,000 novel DD-associated genes await discovery, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of dominant DDs.

Download Full-text

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

10.1101/2020.12.06.413880 ◽

2020 ◽

Author(s):

Anyou Wang ◽

Rong Hai

Keyword(s):

Human Genome ◽

Rna Processing ◽

Self Regulation ◽

Post Translational Modification ◽

Protein Coding ◽

Noncoding Regions ◽

Coding Regions ◽

Rnaseq Data ◽

Response To Stress ◽

Eukaryotic Genomes

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text

SLC12A2 variants cause a neurodevelopmental disorder or cochleovestibular defect

Brain ◽

10.1093/brain/awaa176 ◽

2020 ◽

Vol 143 (8) ◽

pp. 2380-2387 ◽

Cited By ~ 2

Author(s):

Alisdair McNeill ◽

Emanuela Iovino ◽

Luke Mansard ◽

Christel Vache ◽

David Baux ◽

...

Keyword(s):

Hearing Loss ◽

Sensorineural Hearing Loss ◽

Developmental Disorders ◽

De Novo ◽

Neurodevelopmental Disorder ◽

Sensorineural Deafness ◽

Sensorineural Hearing ◽

De Novo Mutation ◽

Xenopus Laevis Oocytes ◽

De Novo Mutations

Abstract The SLC12 gene family consists of SLC12A1–SLC12A9, encoding electroneutral cation-coupled chloride co-transporters. SCL12A2 has been shown to play a role in corticogenesis and therefore represents a strong candidate neurodevelopmental disorder gene. Through trio exome sequencing we identified de novo mutations in SLC12A2 in six children with neurodevelopmental disorders. All had developmental delay or intellectual disability ranging from mild to severe. Two had sensorineural deafness. We also identified SLC12A2 variants in three individuals with non-syndromic bilateral sensorineural hearing loss and vestibular areflexia. The SLC12A2 de novo mutation rate was demonstrated to be significantly elevated in the deciphering developmental disorders cohort. All tested variants were shown to reduce co-transporter function in Xenopus laevis oocytes. Analysis of SLC12A2 expression in foetal brain at 16–18 weeks post-conception revealed high expression in radial glial cells, compatible with a role in neurogenesis. Gene co-expression analysis in cells robustly expressing SLC12A2 at 16–18 weeks post-conception identified a transcriptomic programme associated with active neurogenesis. We identify SLC12A2 de novo mutations as the cause of a novel neurodevelopmental disorder and bilateral non-syndromic sensorineural hearing loss and provide further data supporting a role for this gene in human neurodevelopment.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text