High GC Content Causes Orphan Proteins to be Intrinsically Disordered

Mapping Intimacies ◽

10.1101/103739 ◽

2017 ◽

Author(s):

Walter Basile ◽

Oxana Sachenkova ◽

Sara Light ◽

Arne Elofsson

Keyword(s):

Amino Acids ◽

Structural Properties ◽

De Novo ◽

Gc Content ◽

Large Degree ◽

Protein Coding ◽

Intrinsically Disordered ◽

A Genome ◽

Short Orfs ◽

Orphan Proteins

AbstractDe novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the populationThese orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed.To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken.Author SummaryWe show that the GC content of a genome is of great importance for the properties of an orphan protein. GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Gly contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. The three high GC amino acids are all disorder promoting, while Phe, Tyr and Ile are order promoting. Therefore, random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast, structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation in the population, proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.

Download Full-text

High GC Content Causes De Novo Created Proteins to be Intrinsically Disordered

10.1101/070003 ◽

2016 ◽

Author(s):

Walter Basile ◽

Oxana Sachenkova ◽

Sara Light ◽

Arne Elofsson

Keyword(s):

Structural Properties ◽

De Novo ◽

Gc Content ◽

Strong Relationship ◽

Large Degree ◽

Structural Features ◽

Protein Coding ◽

Intrinsically Disordered ◽

Ancient Proteins ◽

Short Orfs

AbstractDe novo creation of protein coding genes involves formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. De novo created proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not cause aggregation. Therefore, although the creation of the short ORFs could be truly random, but the fixation should be of subject to some selective pressure. The selective forces acting on de novo created proteins have been elusive and contradictory results have been reported. In Drosophila they are more disordered, i.e. are enriched in polar residues, than ancient proteins, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed.To solve this riddle we studied structural properties and age of all proteins in 187 eukaryotic species. We find that, on average, there are small differences between proteins of different ages, with the exception that younger proteins are shorter. However, when we take the GC content into account we find that this can explain the opposite trends observed in yeast (low GC) and drosophila (high GC). GC content is correlated with codons coding for disorder-promoting amino acids, and inversely correlated with transmembrane, helix and sheet promoting residues. We find that for the youngest proteins, i.e. the ones that are most likely to be de novo created, there exists a strong correlation with GC and structural properties. In contrast, this strong relationship is not seen for ancient proteins. This leads us to propose that structural features are not a strong determining factor for fixation of de novo created genes. Instead these proteins resemble random proteins given a particular GC level. The dependency on GC content is then gradually weakened during evolution.Author SummaryWe show that the GC content of a genomic area is of great importance for the properties of a protein-coding de novo created gene. The GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Glu contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. Pro and Gly are disorder-promoting, while Phe, Tyr and Ile are order-promoting. Therefore random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest (orphan) proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation of de novo created proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.

Download Full-text

Draft genome assembly data of Anoxybacillus sp. strain MB8 isolated from Tattapani hot springs, India

10.1101/2021.06.09.447659 ◽

2021 ◽

Author(s):

VISHNU PRASOODANAN P K ◽

Shruti S. Menon ◽

Rituja Saxena ◽

Prashant Waiker ◽

Vineet K Sharma

Keyword(s):

Hot Springs ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Central India ◽

Glycoside Hydrolases ◽

Rrna Gene ◽

Aerobic Bacterium ◽

Protein Coding ◽

Protein Coding Genes

Discovery of novel thermophiles has shown promising applications in the field of biotechnology. Due to their thermal stability, they can survive the harsh processes in the industries, which make them important to be characterized and studied. Members of Anoxybacillus are alkaline tolerant thermophiles and have been extensively isolated from manure, dairy-processed plants, and geothermal hot springs. This article reports the assembled data of an aerobic bacterium Anoxybacillus sp. strain MB8, isolated from the Tattapani hot springs in Central India, where the 16S rRNA gene shares an identity of 97% (99% coverage) with Anoxybacillus kamchatkensis strain G10. The de novo assembly and annotation performed on the genome of Anoxybacillus sp. strain MB8 comprises of 2,898,780 bp (in 190 contigs) with a GC content of 41.8% and includes 2,976 protein-coding genes,1 rRNA operon, 73 tRNAs, 1 tm-RNA and 10 CRISPR arrays. The predicted protein-coding genes have been classified into 21 eggNOG categories. The KEGG Automated Annotation Server (KAAS) analysis indicated the presence of assimilatory sulfate reduction pathway, nitrate reducing pathway, and genes for glycoside hydrolases (GHs) and glycoside transferase (GTs). GHs and GTs hold widespread applications, in the baking and food industry for bread manufacturing, and in the paper, detergent and cosmetic industry. Hence, Anoxybacillus sp. strain MB8 holds the potential to be screened and characterized for such commercially relevant enzymes.

Download Full-text

Draft Genome of the Macadamia Husk Spot Pathogen, Pseudocercospora macadamiae

Phytopathology ◽

10.1094/phyto-12-19-0460-a ◽

2020 ◽

Vol 110 (9) ◽

pp. 1503-1506

Author(s):

Olufemi A. Akinsanmi ◽

Lilia C. Carvalhais

Keyword(s):

Plant Disease Resistance ◽

Plant Disease ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Disease Development ◽

Closely Related Species ◽

Protein Coding ◽

Protein Coding Genes ◽

The Family

Pseudocercospora macadamiae causes husk spot in macadamia in Australia. Lack of genomic resources for this pathogen has restricted acquiring knowledge on the mechanism of disease development, spread, and its role in fruit abscission. To address this gap, we sequenced the genome of P. macadamiae. The sequence was de novo assembled into a draft genome of 40 Mb, which is comparable to closely related species in the family Mycosphaerellaceae. The draft genome comprises 212 scaffolds, of which 99 scaffolds are over 50 kb. The genome has a 49% GC content and is predicted to contain 15,430 protein-coding genes. This draft genome sequence is the first for P. macadamiae and represents a valuable resource for understanding genome evolution and plant disease resistance.

Download Full-text

Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models

BMC Genomics ◽

10.1186/s12864-019-6064-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Jeanne Wilbrandt ◽

Bernhard Misof ◽

Kristen A. Panfilio ◽

Oliver Niehuis

Keyword(s):

Structural Properties ◽

Gene Structure ◽

Comparative Studies ◽

Gene Repertoire ◽

Manual Annotation ◽

Protein Coding ◽

Protein Coding Genes ◽

Gene Sets ◽

A Genome ◽

Gene Models

Abstract Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative.

Download Full-text

The human gene damage index as a gene-level approach to prioritizing exome variants

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1518646112 ◽

2015 ◽

Vol 112 (44) ◽

pp. 13615-13620 ◽

Cited By ~ 131

Author(s):

Yuval Itan ◽

Lei Shang ◽

Bertrand Boisson ◽

Etienne Patin ◽

Alexandre Bolze ◽

...

Keyword(s):

General Population ◽

De Novo ◽

Damage Index ◽

Monogenic Disease ◽

Sequence Length ◽

De Novo Mutations ◽

Protein Coding ◽

A Genome ◽

Gene Level ◽

Gene Damage

The protein-coding exome of a patient with a monogenic disease contains about 20,000 variants, only one or two of which are disease causing. We found that 58% of rare variants in the protein-coding exome of the general population are located in only 2% of the genes. Prompted by this observation, we aimed to develop a gene-level approach for predicting whether a given human protein-coding gene is likely to harbor disease-causing mutations. To this end, we derived the gene damage index (GDI): a genome-wide, gene-level metric of the mutational damage that has accumulated in the general population. We found that the GDI was correlated with selective evolutionary pressure, protein complexity, coding sequence length, and the number of paralogs. We compared GDI with the leading gene-level approaches, genic intolerance, and de novo excess, and demonstrated that GDI performed best for the detection of false positives (i.e., removing exome variants in genes irrelevant to disease), whereas genic intolerance and de novo excess performed better for the detection of true positives (i.e., assessing de novo mutations in genes likely to be disease causing). The GDI server, data, and software are freely available to noncommercial users from lab.rockefeller.edu/casanova/GDI.

Download Full-text

Draft Genome Assembly and Annotation of Red Raspberry Rubus Idaeus

10.1101/546135 ◽

2019 ◽

Cited By ~ 4

Author(s):

Haley Wight ◽

Junhui Zhou ◽

Muzi Li ◽

Sridhar Hannenhalli ◽

Stephen M. Mount ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Rubus Idaeus ◽

Slow Process ◽

Red Raspberry ◽

Protein Coding ◽

Draft Genome Assembly ◽

Protein Coding Genes ◽

A Genome ◽

Exceptional Value

AbstractThe red raspberry, Rubus idaeus, is widely distributed in all temperate regions of Europe, Asia, and North America and is a major commercial fruit valued for its taste, high antioxidant and vitamin content. However, Rubus breeding is a long and slow process hampered by limited genomic and molecular resources. Genomic resources such as a complete genome sequencing and transcriptome will be of exceptional value to improve research and breeding of this high value crop. Using a hybrid sequence assembly approach including data from both long and short sequence reads, we present the first assembly of the Rubus idaeus genome (Joan J. variety). The de novo assembled genome consists of 2,145 scaffolds with a genome completeness of 95.3% and an N50 score of 638 KB. Leveraging a linkage map, we anchored 80.1% of the genome onto seven chromosomes. Using over 1 billion paired-end RNAseq reads, we annotated 35,566 protein coding genes with a transcriptome completeness score of 97.2%. The Rubus idaeus genome provides an important new resource for researchers and breeders.

Download Full-text

Draft Genome Sequence of the UV-Resistant Antarctic Bacterium Sphingomonas sp. Strain UV9

Microbiology Resource Announcements ◽

10.1128/mra.01651-18 ◽

2019 ◽

Vol 8 (7) ◽

Cited By ~ 3

Author(s):

Juan J. Marizcurrena ◽

Danilo Morales ◽

Pablo Smircich ◽

Susana Castro-Sowinski

Keyword(s):

Genome Size ◽

Genome Sequence ◽

Draft Genome ◽

Gc Content ◽

Draft Genome Sequence ◽

Protein Coding ◽

Coding Sequences ◽

Antarctic Bacterium ◽

A Genome ◽

The Antarctic

We report the draft genome sequence of the Antarctic UV-resistant bacterium Sphingomonas sp. strain UV9. The strain has a genome size of 4.25 Mb, a 65.62% GC content, and 3,879 protein-coding sequences.

Download Full-text

De novo assembly and characterization of the first draft genome of quince (Cydonia oblonga Mill.)

Scientific Reports ◽

10.1038/s41598-021-83113-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Aysenur Soyturk ◽

Fatima Sen ◽

Ali Tevfik Uncu ◽

Ibrahim Celik ◽

Ayse Ozgur Uncu

Keyword(s):

Ab Initio ◽

De Novo ◽

Draft Genome ◽

Mirna Precursor ◽

Machine Learning Algorithms ◽

Genomic Research ◽

Support Vector ◽

Cydonia Oblonga ◽

Protein Coding ◽

A Genome

AbstractQuince (Cydonia oblonga Mill.) is the sole member of the genus Cydonia in the Rosacea family and closely related to the major pome fruits, apple (Malus domestica Borkh.) and pear (Pyrus communis L.). In the present work, whole genome shotgun paired-end sequencing was employed in order to assemble the first draft genome of quince. A genome assembly that spans 488.4 Mb of sequence corresponding to 71.2% of the estimated genome size (686 Mb) was produced in the study. Gene predictions via ab initio and homology-based sequence annotation strategies resulted in the identification of 25,428 and 30,684 unique putative protein coding genes, respectively. 97.4 and 95.6% of putative homologs of Arabidopsis and rice transcription factors were identified in the ab initio predicted genic sequences. Different machine learning algorithms were tested for classifying pre-miRNA (precursor microRNA) coding sequences, identifying Support Vector Machine (SVM) as the best performing classifier. SVM classification predicted 600 putative pre-miRNA coding loci. Repetitive DNA content of the assembly was also characterized. The first draft assembly of the quince genome produced in this work would constitute a foundation for functional genomic research in quince toward dissecting the genetic basis of important traits and performing genomics-assisted breeding.

Download Full-text

Beyond Microsatellite Instability: Intrinsic Disorder as a Potential Link Between Protein Short Tandem Repeats and Cancer

Frontiers in Bioinformatics ◽

10.3389/fbinf.2021.685844 ◽

2021 ◽

Vol 1 ◽

Author(s):

Max A. Verbiest ◽

Matteo Delucchi ◽

Tugce Bilgin Sonay ◽

Maria Anisimova

Keyword(s):

Amino Acids ◽

Microsatellite Instability ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Disordered Proteins ◽

Future Research ◽

Protein Coding ◽

Intrinsically Disordered ◽

Intrinsically Disordered Regions ◽

Short Tandem

Short tandem repeats (STRs) are abundant in genomic sequences and are known for comparatively high mutation rates; STRs therefore are thought to be a potent source of genetic diversity. In protein-coding sequences STRs primarily encode disorder-promoting amino acids and are often located in intrinsically disordered regions (IDRs). STRs are frequently studied in the scope of microsatellite instability (MSI) in cancer, with little focus on the connection between protein STRs and IDRs. We believe, however, that this relationship should be explicitly included when ascertaining STR functionality in cancer. Here we explore this notion using all canonical human proteins from SwissProt, wherein we detected 3,699 STRs. Over 80% of these consisted completely of disorder promoting amino acids. 62.1% of amino acids in STR sequences were predicted to also be in an IDR, compared to 14.2% for non-repeat sequences. Over-representation analysis showed STR-containing proteins to be primarily located in the nucleus where they perform protein- and nucleotide-binding functions and regulate gene expression. They were also enriched in cancer-related signaling pathways. Furthermore, we found enrichments of STR-containing proteins among those correlated with patient survival for cancers derived from eight different anatomical sites. Intriguingly, several of these cancer types are not known to have a MSI-high (MSI-H) phenotype, suggesting that protein STRs play a role in cancer pathology in non MSI-H settings. Their intrinsic link with IDRs could therefore be an attractive topic of future research to further explore the role of STRs and IDRs in cancer. We speculate that our observations may be linked to the known dosage-sensitivity of disordered proteins, which could hint at a concentration-dependent gain-of-function mechanism in cancer for proteins containing STRs and IDRs.

Download Full-text

Four high-quality draft genome assemblies of the marine heterotrophic nanoflagellate Cafeteria roenbergensis

10.1101/751586 ◽

2019 ◽

Cited By ~ 1

Author(s):

Thomas Hackl ◽

Roman Martin ◽

Karina Barenhoff ◽

Sarah Duponchel ◽

Dominik Heider ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Gc Content ◽

Illumina Miseq ◽

Evolutionary Analysis ◽

High Quality ◽

Protein Coding ◽

Repeat Content ◽

Genome Assemblies ◽

Cafeteria Roenbergensis

AbstractThe heterotrophic stramenopile Cafeteria roenbergensis is a globally distributed marine bacterivorous protist. This unicellular flagellate is host to the giant DNA virus CroV and the virophage mavirus. We sequenced the genomes of four cultured C. roenbergensis strains and generated 23.53 Gb of Illumina MiSeq data (99-282 × coverage per strain) and 5.09 Gb of PacBio RSII data (13-54 × coverage). Using the Canu assembler and customized curation procedures, we obtained high-quality draft genome assemblies with a total length of 34-36 Mbp per strain and contig N50 lengths of 148 kbp to 464 kbp. The C. roenbergensis genome has a GC content of ~70%, a repeat content of ~28%, and is predicted to contain approximately 7857-8483 protein-coding genes based on a combination of de novo, homology-based and transcriptome-supported annotation. These first high-quality genome assemblies of a Bicosoecid fill an important gap in sequenced Stramenopile representatives and enable a more detailed evolutionary analysis of heterotrophic protists.

Download Full-text