Identification of protein coding regions in the human genome by quadratic discriminant analysis

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text

Mutation severity spectrum of rare alleles in the human genome is predictive of disease type

10.1101/835462 ◽

2019 ◽

Author(s):

Jimin Pei ◽

Lisa Kinch ◽

Nick V. Grishin

Keyword(s):

Human Genome ◽

Genetic Disorders ◽

Single Amino Acid ◽

Missense Mutations ◽

Single Nucleotide ◽

Protein Coding ◽

Coding Regions ◽

Structural And Functional Properties ◽

Disease Associations ◽

Disease Associated Genes

AbstractThe human genome harbors a variety of genetic variations. Single-nucleotide changes that alter amino acids in protein-coding regions are one of the major causes of human phenotypic variation and diseases. These single-amino acid variations (SAVs) are routinely found in whole genome and exome sequencing. Evaluating the functional impact of such genomic alterations is crucial for diagnosis of genetic disorders. We developed DeepSAV, a deep-learning convolutional neural network to differentiate disease-causing and benign SAVs based on a variety of protein sequence, structural and functional properties. Our method outperforms most stand-alone programs and has similar predictive power as some of the best available. We transformed DeepSAV scores of rare SAVs observed in the general population into a mutation severity measure of protein-coding genes. This measure reflects a gene’s tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. Genes implicated in cancer, autism, and viral interaction are found by this measure as intolerant to mutations, while genes associated with a number of other diseases are scored as tolerant. Among known disease-associated genes, those that are mutation-intolerant are likely to function in development and signal transduction pathways, while those that are mutation-tolerant tend to encode metabolic and mitochondrial proteins.

Download Full-text

An Integrated Mass-Spectrometry Pipeline Identifies Novel Protein Coding-Regions in the Human Genome

PLoS ONE ◽

10.1371/journal.pone.0008949 ◽

2010 ◽

Vol 5 (1) ◽

pp. e8949 ◽

Cited By ~ 24

Author(s):

Danny A. Bitton ◽

Duncan L. Smith ◽

Yvonne Connolly ◽

Paul J. Scutt ◽

Crispin J. Miller

Keyword(s):

Mass Spectrometry ◽

Human Genome ◽

Protein Coding ◽

Coding Regions ◽

Novel Protein

Download Full-text

Identification of Protein Coding Regions of Rice Genes Using Alternative Spectral Rotation Measure and Linear Discriminant Analysis

Genomics Proteomics & Bioinformatics ◽

10.1016/s1672-0229(04)02022-4 ◽

2004 ◽

Vol 2 (3) ◽

pp. 167-173 ◽

Cited By ~ 3

Author(s):

Jiao Jin

Keyword(s):

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Protein Coding ◽

Linear Discriminant ◽

Coding Regions ◽

Rotation Measure

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text

Human knockouts in a cohort with a high rate of consanguinity

10.1101/031518 ◽

2015 ◽

Cited By ~ 9

Author(s):

Danesh Saleheen ◽

Pradeep Natarajan ◽

Wei Zhao ◽

Asif Rasheed ◽

Sumeet Khetarpal ◽

...

Keyword(s):

Human Genome ◽

Large Fraction ◽

High Rate ◽

Phenotypic Analysis ◽

Protein Coding ◽

Coding Regions ◽

Complete Inactivation ◽

Apolipoprotein C ◽

Adult Participants ◽

Complete Disruption

A major goal of biomedicine is to understand the function of every gene in the human genome. Null mutations can disrupt both copies of a given gene in humans and phenotypic analysis of such 'human knockouts' can provide insight into gene function. To date, comprehensive analysis of genes knocked out in humans has been limited by the fact that null mutations are infrequent in the general population and so, observing an individual homozygous null for a given gene is exceedingly rare. However, consanguineous unions are more likely to result in offspring who carry homozygous null mutations. In Pakistan, consanguinity rates are notably high. Here, we sequenced the protein-coding regions of 7,078 adult participants living in Pakistan and performed phenotypic analysis to identify homozygous null individuals and to understand consequences of complete gene disruption in humans. We enumerated 36,850 rare (<1 % minor allele frequency) null mutations. These homozygous null mutations led to complete inactivation of 961 genes in at least one participant. Homozygosity for null mutations at APOC3 was associated with absent plasma apolipoprotein C-III levels; at PLAG27, with absent enzymatic activity of soluble lipoprotein-associated phospholipase A2; at CYP2F1, with higher plasma interleukin-8 concentrations; and at either A3GALT2 or NRG4, with markedly reduced plasma insulin C-peptide concentrations. After physiologic challenge with oral fat, APOC3 knockouts displayed marked blunting of the usual post-prandial rise in plasma triglycerides compared to wild-type family members. These observations provide a roadmap to understand the consequences of complete disruption of a large fraction of genes in the human genome.

Download Full-text

Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions

BMC Genomics ◽

10.1186/1471-2164-14-141 ◽

2013 ◽

Vol 14 (1) ◽

pp. 141 ◽

Cited By ~ 43

Author(s):

Jainab Khatun ◽

Yanbao Yu ◽

John A Wrobel ◽

Brian A Risk ◽

Harsha P Gunawardena ◽

...

Keyword(s):

Human Genome ◽

Cell Line ◽

Protein Coding ◽

Coding Regions ◽

Proteogenomic Mapping

Download Full-text

Robust discriminant analysis and its application to identify protein coding regions of rice genes

Mathematical Biosciences ◽

10.1016/j.mbs.2011.04.007 ◽

2011 ◽

Vol 232 (2) ◽

pp. 96-100 ◽

Cited By ~ 3

Author(s):

Jiao Jin ◽

Jinbing An

Keyword(s):

Discriminant Analysis ◽

Protein Coding ◽

Coding Regions

Download Full-text

Non-coding RNAs and disease: the classical ncRNAs make a comeback

Biochemical Society Transactions ◽

10.1042/bst20160089 ◽

2016 ◽

Vol 44 (4) ◽

pp. 1073-1078 ◽

Cited By ~ 36

Author(s):

Rogerio Alves de Almeida ◽

Marcin G. Fraczek ◽

Steven Parker ◽

Daniela Delneri ◽

Raymond T. O'Keefe

Keyword(s):

Human Genome ◽

Human Disease ◽

Human Diseases ◽

Protein Coding ◽

Coding Regions ◽

Disease Biology ◽

The Future ◽

Future Potential ◽

Non Coding Rnas ◽

Disproportionate Number

Many human diseases have been attributed to mutation in the protein coding regions of the human genome. The protein coding portion of the human genome, however, is very small compared with the non-coding portion of the genome. As such, there are a disproportionate number of diseases attributed to the coding compared with the non-coding portion of the genome. It is now clear that the non-coding portion of the genome produces many functional non-coding RNAs and these RNAs are slowly being linked to human diseases. Here we discuss examples where mutation in classical non-coding RNAs have been attributed to human disease and identify the future potential for the non-coding portion of the genome in disease biology.

Download Full-text

A map of constrained coding regions in the human genome

10.1101/220814 ◽

2017 ◽

Cited By ~ 8

Author(s):

James M. Havrilla ◽

Brent S. Pedersen ◽

Ryan M. Layer ◽

Aaron R. Quinlan

Keyword(s):

Human Genome ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Protein Domain ◽

De Novo Mutations ◽

Protein Coding ◽

Constrained Coding ◽

Coding Regions ◽

Pathogenic Variants

ABSTRACTDeep catalogs of genetic variation collected from many thousands of humans enable the detection of intraspecies constraint by revealing coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single metrics cannot capture the fine-scale variability in constraint within each protein-coding gene. To provide greater resolution, we have created a detailed map of constrained coding regions (CCRs) in the human genome by leveraging coding variation observed among 123,136 humans from the Genome Aggregation Database (gnomAD). The most constrained coding regions in our map are enriched for both pathogenic variants in ClinVar and de novo mutations underlying developmental disorders. CCRs also reveal protein domain families under high constraint, suggest unannotated or incomplete protein domains, and facilitate the prioritization of previously unseen variation in studies of disease. Finally, a subset of CCRs with the highest constraint likely exist within genes that cause yet unobserved human phenotypes owing to strong purifying selection.

Download Full-text