Natural selection on protein-coding genes in the human genome

Abstract The completion of human genome sequences and the advancement of next-generation sequencing technologies have engendered a clear understanding of all human genes. Overlapping genes are usually observed in compact genomes, such as those of bacteria and viruses. Notably, overlapping protein-coding genes do exist in human genome sequences. Accordingly, we used the current Ensembl gene annotations to identify overlapping human protein-coding genes. We analysed 19,200 well-annotated protein-coding genes and determined that 4,951 protein-coding genes overlapped with their adjacent genes. Approximately a quarter of all human protein-coding genes were overlapping genes. We observed different clusters of overlapping protein-coding genes, ranging from two genes (paired overlapping genes) to 22 genes. We also divided the paired overlapping protein-coding gene groups into four subtypes. We found that the divergent overlapping gene subtype had a stronger expression association than did the subtypes of 5ʹ-tandem overlapping and 3ʹ-tandem overlapping genes. The majority of paired overlapping genes exhibited comparable coincidental tissue expression profiles; however, a few overlapping gene pairs displayed distinctive tissue expression association patterns. In summary, we have carefully examined the genomic features and distributions about human overlapping protein-coding genes and found coincidental expression in tissues for most overlapping protein-coding genes.

Download Full-text

The distribution pattern of genetic variation in the transcript isoforms of the alternatively spliced protein-coding genes in the human genome

Molecular BioSystems ◽

10.1039/c5mb00132c ◽

2015 ◽

Vol 11 (5) ◽

pp. 1378-1388 ◽

Cited By ~ 8

Author(s):

Ting Liu ◽

Kui Lin

Keyword(s):

Genetic Variation ◽

Distribution Pattern ◽

Human Genome ◽

Protein Coding ◽

Transcript Isoforms ◽

Protein Coding Genes ◽

Alternatively Spliced

The relationships among the types of transcripts, the classes of coding SNPs and the population frequencies in the human genome.

Download Full-text

4. Proteins

10.1093/actrade/9780198723882.003.0004 ◽

2016 ◽

Author(s):

Aysha Divan ◽

Janice A. Royds

Keyword(s):

Alternative Splicing ◽

Human Genome ◽

The Body ◽

Biological Functions ◽

Protein Coding ◽

Post Translational Modifications ◽

Protein Coding Genes ◽

Composition And Structure ◽

A Cell ◽

Structure Of Proteins

Biological functions require protein and the protein makeup of a cell determines its behaviour and identity. Proteins, therefore, are the most abundant molecules in the body except for water. The approximately 20,000 protein coding genes in the human genome can, by alternative splicing, multiple translation starts, and post-translational modifications, produce over 1,000,000 different proteins, collectively called ‘the proteome’. It is the size of the proteome and not the genome that defines the complexity of an organism. ‘Proteins’ describes the composition and structure of proteins and how they are studied. What information is required in order to understand how proteins work and what happens when this function is impaired in disease?

Download Full-text

Gene Expression Profile in Responsive and Non-Responsive Chronic Myeloid Leukemia Patients Treated with Dasatinib.

Blood ◽

10.1182/blood.v114.22.3260.3260 ◽

2009 ◽

Vol 114 (22) ◽

pp. 3260-3260

Author(s):

Rosana A Silveira ◽

Angela A Fachel ◽

Yuri B Moreira ◽

Marcia T Delamain ◽

Carmino Antonio De Souza ◽

...

Keyword(s):

Gene Expression ◽

Human Genome ◽

Mononuclear Cells ◽

Cytogenetic Response ◽

Post Treatment ◽

Differentially Expressed ◽

Regulation Of Transcription ◽

Protein Coding ◽

Altered Expression ◽

Protein Coding Genes

Abstract Abstract 3260 Poster Board III-1 Background: CML treatment with tyrosine kinase inhibitors induces high and durable rates of complete cytogenetic response. Despite treatment efficacy, a significant proportion of patients develop resistance to these drugs. We measured gene expression profiles in an attempt to identify gene pathways that may be associated with dasatinib resistance. Patients and Methods: Mononuclear cells were separated from peripheral blood samples from seven CML patients resistant to imatinib, collected prior and after dasatinib treatment. Three patients who achieved partial cytogenetic response (Ph-positive cells: 1% - 35%) within twelve months were considered responders (R), whereas four patients who failed to achieve PCyR within 12 months of treatment were classified as non-responders. RNA samples prepared from peripheral mononuclear cells were hybridized to Agilent Technologies 4×44K Whole Human Genome Microarrays (WHGM) and 4×44K intronic-exonic custom oligoarrays. The latter was developed by Verjovski-Almeida's group (Nakaya et al, Genome Biology 2007, 8:R43) and contains sense and antisense probes that map to intronic regions in the human genome representing totally (TIN) and partially (PIN) intronic non-coding RNAs (ncRNAs), in addition to probes for the corresponding protein-coding genes of the same loci. Raw microarray data were normalized by the Affy package in statistical R language implemented in the Bioconductor platform. Each sample was labeled in replicate with Cy3 or Cy5 and the two were considered technical replicates. Two independent statistical approaches SAM (Significance Analysis of Microarrays) and Golub's discrimination score (SNR, Signal to Noise Ratio, with permutations) were performed to identify differentially expressed transcripts between responder and non-responder patients. For the intronic-exonic platform, the analysis parameters were FDR 10%, SNR>1.5 and p<0.01, and for WHGM platform parameters were FDR 5%, SNR>1.5 and p<0.001. For this latter platform, we also performed a patient leave-one-out analysis. Functions of transcripts differentially expressed were annotated and compared using GO Biological Process categories (www.genetools.microarray.ntu.no/egon). Results: We identified 34 ncRNAs with altered expression (26 over and 8 underexpressed in responders) in pre-treatment samples and 33 ncRNAs (20 over and 13 underexpressed in responders) in post-treatment samples. Functions associated with protein-coding genes from the same genomic loci as those of the intronic differentially expressed ncRNAs were: regulation of transcription (PRMT5, SOD2, SSBP3, BCL7A, MLL), signal transduction (PRKCB1, RASGRP2, NF1, PXN) and apoptosis (BCL2, PCSK6, TNFAIP8, EIF4G2). WHGM platform data analysis showed 63 and 250 protein-coding genes differentially expressed in pre and post-treatment samples, respectively. We observed a higher number of protein-coding genes with altered expression after treatment in the following functions: cell communication, immune response and metabolic process (p<0.02). Conclusions: Overall, these findings indicate that protein-coding genes and intronic ncRNAs may be related to dasatinib resistance and response to treatment. In particular, altered expression of ncRNAs transcribed from the introns of ‘regulation of transcription' genes could be part of an important alternative mechanism of gene expression control during emergence of resistance.Support: FAPESP (2005/60266-8) Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Extreme purifying selection against point mutations in the human genome

10.1101/2021.08.23.457339 ◽

2021 ◽

Author(s):

Noah Dukler ◽

Mehreen R Mughal ◽

Ritika Ramani ◽

Yi-Fei Huang ◽

Adam Siepel

Keyword(s):

Human Genome ◽

De Novo ◽

Point Mutations ◽

Purifying Selection ◽

Selection Coefficient ◽

Sequencing Data ◽

Protein Coding ◽

Coding Regions ◽

Protein Coding Genes ◽

Selective Effects

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.

Download Full-text

Obtaining estimates for the ages of all the protein-coding genes and most of the ontology-identified noncoding genes of the human genome, assigned to 19 phylostrata

Seminars in Oncology ◽

10.1053/j.seminoncol.2018.11.002 ◽

2019 ◽

Vol 46 (1) ◽

pp. 3-9 ◽

Cited By ~ 3

Author(s):

Thomas Litman ◽

Wilfred D. Stein

Keyword(s):

Human Genome ◽

Protein Coding ◽

Protein Coding Genes

Download Full-text

Natural selection in avian protein-coding genes expressed in brain

Molecular Ecology ◽

10.1111/j.1365-294x.2008.03795.x ◽

2008 ◽

Vol 17 (12) ◽

pp. 3008-3017 ◽

Cited By ~ 40

Author(s):

ERIK AXELSSON ◽

LINA HULTIN-ROSENBERG ◽

MIKAEL BRANDSTRÖM ◽

MARTIN ZWAHLÉN ◽

DAVID F. CLAYTON ◽

...

Keyword(s):

Natural Selection ◽

Protein Coding ◽

Protein Coding Genes

Download Full-text

Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome

Nucleic Acids Research ◽

10.1093/nar/gkt1300 ◽

2013 ◽

Vol 42 (5) ◽

pp. 2820-2832 ◽

Cited By ~ 14

Author(s):

Nicolas Philippe ◽

Elias Bou Samra ◽

Anthony Boureux ◽

Alban Mancheron ◽

Florence Rufflé ◽

...

Keyword(s):

Human Genome ◽

Rna Sequencing ◽

Dynamic Range ◽

Tiling Array ◽

Expression Data ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Protein Coding ◽

Protein Coding Genes

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.

Download Full-text