Protein-Coding Hotspots in the Human Genome: Annotation, Significance, and Their Conservation in Animal Models (mouse, fruit fly)

The shrinking human protein coding complement: are there fewer than 20,000 genes?

10.1101/001909 ◽

2014 ◽

Cited By ~ 2

Author(s):

Iakes Ezkurdia ◽

David Juan ◽

Jose Manuel Rodriguez ◽

Adam Frankish ◽

Mark Deikhans ◽

...

Keyword(s):

Protein Expression ◽

Human Genome ◽

Genome Annotation ◽

Large Scale ◽

Cellular Protein ◽

Human Protein ◽

Protein Coding ◽

Detection Rates ◽

Protein Coding Genes ◽

Peptide Mass

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we map the peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation the human genome. We find that conservation across vertebrate species and the age of the gene family are key indicators of whether a peptide will be detected in proteomics experiments. We find peptides for most highly conserved genes and for practically all genes that evolved before bilateria. At the same time there is almost no evidence of protein expression for genes that have appeared since primates, or for genes that do not have any protein-like features or cross-species conservation. We identify 19 non-protein-like features such as weak conservation, no protein features or ambiguous annotations in major databases that are indicators of low peptide detection rates. We use these features to describe a set of 2,001 genes that are potentially non-coding, and show that many of these genes behave more like non-coding genes than protein-coding genes. We detect peptides for just 3% of these genes. We suggest that many of these 2,001 genes do not code for proteins under normal circumstances and that they should not be included in the human protein coding gene catalogue. These potential non-coding genes will be revised as part of the ongoing human genome annotation effort.

Download Full-text

Characterization of nucleic acids from extracellular vesicle-enriched human sweat

BMC Genomics ◽

10.1186/s12864-021-07733-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Geneviève Bart ◽

Daniel Fischer ◽

Anatoliy Samoylenko ◽

Artem Zhyvolozhnyi ◽

Pavlo Stehantsev ◽

...

Keyword(s):

Nucleic Acids ◽

Human Genome ◽

Body Fluids ◽

Lower Percentage ◽

Rna Seq ◽

Protein Coding ◽

Human Sweat ◽

Dna And Rna ◽

Ribonucleoprotein Complexes ◽

Eccrine Glands

Abstract Background The human sweat is a mixture of secretions from three types of glands: eccrine, apocrine, and sebaceous. Eccrine glands open directly on the skin surface and produce high amounts of water-based fluid in response to heat, emotion, and physical activity, whereas the other glands produce oily fluids and waxy sebum. While most body fluids have been shown to contain nucleic acids, both as ribonucleoprotein complexes and associated with extracellular vesicles (EVs), these have not been investigated in sweat. In this study we aimed to explore and characterize the nucleic acids associated with sweat particles. Results We used next generation sequencing (NGS) to characterize DNA and RNA in pooled and individual samples of EV-enriched sweat collected from volunteers performing rigorous exercise. In all sequenced samples, we identified DNA originating from all human chromosomes, but only the mitochondrial chromosome was highly represented with 100% coverage. Most of the DNA mapped to unannotated regions of the human genome with some regions highly represented in all samples. Approximately 5 % of the reads were found to map to other genomes: including bacteria (83%), archaea (3%), and virus (13%), identified bacteria species were consistent with those commonly colonizing the human upper body and arm skin. Small RNA-seq from EV-enriched pooled sweat RNA resulted in 74% of the trimmed reads mapped to the human genome, with 29% corresponding to unannotated regions. Over 70% of the RNA reads mapping to an annotated region were tRNA, while misc. RNA (18,5%), protein coding RNA (5%) and miRNA (1,85%) were much less represented. RNA-seq from individually processed EV-enriched sweat collection generally resulted in fewer percentage of reads mapping to the human genome (7–45%), with 50–60% of those reads mapping to unannotated region of the genome and 30–55% being tRNAs, and lower percentage of reads being rRNA, LincRNA, misc. RNA, and protein coding RNA. Conclusions Our data demonstrates that sweat, as all other body fluids, contains a wealth of nucleic acids, including DNA and RNA of human and microbial origin, opening a possibility to investigate sweat as a source for biomarkers for specific health parameters.

Download Full-text

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

10.1101/2020.12.06.413880 ◽

2020 ◽

Author(s):

Anyou Wang ◽

Rong Hai

Keyword(s):

Human Genome ◽

Rna Processing ◽

Self Regulation ◽

Post Translational Modification ◽

Protein Coding ◽

Noncoding Regions ◽

Coding Regions ◽

Rnaseq Data ◽

Response To Stress ◽

Eukaryotic Genomes

AbstractEukaryotic genomes gradually gain noncoding regions when advancing evolution and human genome actively transcribes >90% of its noncoding regions1, suggesting their criticality in evolutionary human genome. Yet <1% of them have been functionally characterized2, leaving most human genome in dark. Here we systematically decode endogenous lncRNAs located in unannotated regions of human genome and decipher a distinctive functional regime of lncRNAs hidden in massive RNAseq data. LncRNAs divergently distribute across chromosomes, independent of protein-coding regions. Their transcriptions barely initiate on promoters through polymerase II, but mostly on enhancers. Yet conventional enhancer activators(e.g. H3K4me1) only account for a small proportion of lncRNA activation, suggesting alternatively unknown mechanisms initiating the majority of lncRNAs. Meanwhile, lncRNA-self regulation also notably contributes to lncRNA activation. LncRNAs trans-regulate broad bioprocesses, including transcription and RNA processing, cell cycle, respiration, response to stress, chromatin organization, post-translational modification, and development. Overall lncRNAs govern their owned regime distinctive from protein’s.

Download Full-text

Overlapping protein-coding genes in human genome and their coincidental expression in tissues

Scientific Reports ◽

10.1038/s41598-019-49802-w ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 2

Author(s):

Chao-Hsin Chen ◽

Chao-Yu Pan ◽

Wen-chang Lin

Keyword(s):

Human Genome ◽

Expression Profiles ◽

Tissue Expression ◽

Human Protein ◽

Clear Understanding ◽

Overlapping Genes ◽

Genome Sequences ◽

Protein Coding ◽

Protein Coding Genes ◽

Overlapping Gene

Abstract The completion of human genome sequences and the advancement of next-generation sequencing technologies have engendered a clear understanding of all human genes. Overlapping genes are usually observed in compact genomes, such as those of bacteria and viruses. Notably, overlapping protein-coding genes do exist in human genome sequences. Accordingly, we used the current Ensembl gene annotations to identify overlapping human protein-coding genes. We analysed 19,200 well-annotated protein-coding genes and determined that 4,951 protein-coding genes overlapped with their adjacent genes. Approximately a quarter of all human protein-coding genes were overlapping genes. We observed different clusters of overlapping protein-coding genes, ranging from two genes (paired overlapping genes) to 22 genes. We also divided the paired overlapping protein-coding gene groups into four subtypes. We found that the divergent overlapping gene subtype had a stronger expression association than did the subtypes of 5ʹ-tandem overlapping and 3ʹ-tandem overlapping genes. The majority of paired overlapping genes exhibited comparable coincidental tissue expression profiles; however, a few overlapping gene pairs displayed distinctive tissue expression association patterns. In summary, we have carefully examined the genomic features and distributions about human overlapping protein-coding genes and found coincidental expression in tissues for most overlapping protein-coding genes.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

Predicting Coding Potential from Genome Sequence: Application to Betaherpesviruses Infecting Rats and Mice

Journal of Virology ◽

10.1128/jvi.79.12.7570-7596.2005 ◽

2005 ◽

Vol 79 (12) ◽

pp. 7570-7596 ◽

Cited By ~ 46

Author(s):

Luciano Brocchieri ◽

Thomas N. Kledal ◽

Samuel Karlin ◽

Edward S. Mocarski

Keyword(s):

Genome Annotation ◽

Mrna Splicing ◽

Overlapping Genes ◽

Genome Sequences ◽

Protein Coding ◽

Coding Regions ◽

Translation Signals ◽

Rats And Mice ◽

Coding Potential ◽

Exon Gene

ABSTRACT Prediction of protein-coding regions and other features of primary DNA sequence have greatly contributed to experimental biology. Significant challenges remain in genome annotation methods, including the identification of small or overlapping genes and the assessment of mRNA splicing or unconventional translation signals in expression. We have employed a combined analysis of compositional biases and conservation together with frame-specific G+C representation to reevaluate and annotate the genome sequences of mouse and rat cytomegaloviruses. Our analysis predicts that there are at least 34 protein-coding regions in these genomes that were not apparent in earlier annotation efforts. These include 17 single-exon genes, three new exons of previously identified genes, a newly identified four-exon gene for a lectin-like protein (in rat cytomegalovirus), and 10 probable frameshift extensions of previously annotated genes. This expanded set of candidate genes provides an additional basis for investigation in cytomegalovirus biology and pathogenesis.

Download Full-text

A heterodimer of evolved designer-recombinases precisely excises a human genomic DNA locus

Nucleic Acids Research ◽

10.1093/nar/gkz1078 ◽

2019 ◽

Vol 48 (1) ◽

pp. 472-485 ◽

Cited By ~ 5

Author(s):

Felix Lansing ◽

Maciej Paszkowski-Rogacz ◽

Lukas Theo Schmitt ◽

Paul Martin Schneider ◽

Teresa Rojo Romanos ◽

...

Keyword(s):

Human Genome ◽

Genome Editing ◽

Large Scale ◽

Genome Engineering ◽

Genomic Sequence ◽

Safe Alternative ◽

Protein Coding ◽

Dna Binding Specificity ◽

Human Genomic ◽

Human Genomic Sequence

Abstract Site-specific recombinases (SSRs) such as the Cre/loxP system are useful genome engineering tools that can be repurposed by altering their DNA-binding specificity. However, SSRs that delete a natural sequence from the human genome have not been reported thus far. Here, we describe the generation of an SSR system that precisely excises a 1.4 kb fragment from the human genome. Through a streamlined process of substrate-linked directed evolution we generated two separate recombinases that, when expressed together, act as a heterodimer to delete a human genomic sequence from chromosome 7. Our data indicates that designer-recombinases can be generated in a manageable timeframe for precision genome editing. A large-scale bioinformatics analysis suggests that around 13% of all human protein-coding genes could be targetable by dual designer-recombinase induced genomic deletion (dDRiGD). We propose that heterospecific designer-recombinases, which work independently of the host DNA repair machinery, represent an efficient and safe alternative to nuclease-based genome editing technologies.

Download Full-text

The distribution pattern of genetic variation in the transcript isoforms of the alternatively spliced protein-coding genes in the human genome

Molecular BioSystems ◽

10.1039/c5mb00132c ◽

2015 ◽

Vol 11 (5) ◽

pp. 1378-1388 ◽

Cited By ~ 8

Author(s):

Ting Liu ◽

Kui Lin

Keyword(s):

Genetic Variation ◽

Distribution Pattern ◽

Human Genome ◽

Protein Coding ◽

Transcript Isoforms ◽

Protein Coding Genes ◽

Alternatively Spliced

The relationships among the types of transcripts, the classes of coding SNPs and the population frequencies in the human genome.

Download Full-text

RNomics Analysis of novel in silico derived non-protein coding RNAs in the human genome

10.1240/sav_gbm_2005_h_001224 ◽

2005 ◽

Vol 2005 (Fall) ◽

Author(s):

Chenna Reddy Galiveti ◽

Steffen Hennig ◽

James Adjaye ◽

Ralf Sudbrak ◽

Michal Janitz ◽

...

Keyword(s):

Human Genome ◽

In Silico ◽

Protein Coding

Download Full-text

Administrative Developments: Celera Genomics to Complete DNA Map

The Journal of Law Medicine & Ethics ◽

10.1111/j.1748-720x.2000.tb00010.x ◽

2000 ◽

Vol 28 (2) ◽

pp. 188-189

Author(s):

Jennifer Doran

Keyword(s):

Human Genome ◽

Fruit Fly ◽

Early Summer ◽

Craig Venter ◽

Congressional Committee ◽

Its Analysis ◽

Human Dna ◽

The Creation

On April 6, 2000, Dr. J. Craig Venter of Celera Genomics told a Congressional committee that his company finished its analysis of the human DNA and would have a completed map of the human genome by early summer, 2000. Scientists expect the completed human genome to revolutionize drug therapies through the creation of treatments tailored to specific genetic makeups. In order to create a map of the human genome, three billion letters of DNA that encode eighty thousand genes must be identified and ordered. In March, 2000, Celera released a successful sequence of the fruit fly genome, and it employed the same methods in creating the human genome.

Download Full-text

Protein-Coding Hotspots in the Human Genome: Annotation, Significance, and Their Conservation in Animal Models (mouse, fruit fly)

The shrinking human protein coding complement: are there fewer than 20,000 genes?

Characterization of nucleic acids from extracellular vesicle-enriched human sweat

Distinctive functional regime of endogenous lncRNAs in dark regions of human genome

Overlapping protein-coding genes in human genome and their coincidental expression in tissues

Recovery of non-reference sequences missing from the human reference genome

Predicting Coding Potential from Genome Sequence: Application to Betaherpesviruses Infecting Rats and Mice

A heterodimer of evolved designer-recombinases precisely excises a human genomic DNA locus

The distribution pattern of genetic variation in the transcript isoforms of the alternatively spliced protein-coding genes in the human genome

RNomics  Analysis of novel in silico derived non-protein coding RNAs in the human genome

Administrative Developments: Celera Genomics to Complete DNA Map

RNomics Analysis of novel in silico derived non-protein coding RNAs in the human genome