High GC Content Causes De Novo Created Proteins to be Intrinsically Disordered

Mapping Intimacies ◽

10.1101/070003 ◽

2016 ◽

Author(s):

Walter Basile ◽

Oxana Sachenkova ◽

Sara Light ◽

Arne Elofsson

Keyword(s):

Structural Properties ◽

De Novo ◽

Gc Content ◽

Strong Relationship ◽

Large Degree ◽

Structural Features ◽

Protein Coding ◽

Intrinsically Disordered ◽

Ancient Proteins ◽

Short Orfs

AbstractDe novo creation of protein coding genes involves formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the population. De novo created proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not cause aggregation. Therefore, although the creation of the short ORFs could be truly random, but the fixation should be of subject to some selective pressure. The selective forces acting on de novo created proteins have been elusive and contradictory results have been reported. In Drosophila they are more disordered, i.e. are enriched in polar residues, than ancient proteins, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed.To solve this riddle we studied structural properties and age of all proteins in 187 eukaryotic species. We find that, on average, there are small differences between proteins of different ages, with the exception that younger proteins are shorter. However, when we take the GC content into account we find that this can explain the opposite trends observed in yeast (low GC) and drosophila (high GC). GC content is correlated with codons coding for disorder-promoting amino acids, and inversely correlated with transmembrane, helix and sheet promoting residues. We find that for the youngest proteins, i.e. the ones that are most likely to be de novo created, there exists a strong correlation with GC and structural properties. In contrast, this strong relationship is not seen for ancient proteins. This leads us to propose that structural features are not a strong determining factor for fixation of de novo created genes. Instead these proteins resemble random proteins given a particular GC level. The dependency on GC content is then gradually weakened during evolution.Author SummaryWe show that the GC content of a genomic area is of great importance for the properties of a protein-coding de novo created gene. The GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Glu contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. Pro and Gly are disorder-promoting, while Phe, Tyr and Ile are order-promoting. Therefore random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest (orphan) proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation of de novo created proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.

Download Full-text

High GC Content Causes Orphan Proteins to be Intrinsically Disordered

10.1101/103739 ◽

2017 ◽

Author(s):

Walter Basile ◽

Oxana Sachenkova ◽

Sara Light ◽

Arne Elofsson

Keyword(s):

Amino Acids ◽

Structural Properties ◽

De Novo ◽

Gc Content ◽

Large Degree ◽

Protein Coding ◽

Intrinsically Disordered ◽

A Genome ◽

Short Orfs ◽

Orphan Proteins

AbstractDe novo creation of protein coding genes involves the formation of short ORFs from noncoding regions; some of these ORFs might then become fixed in the populationThese orphan proteins need to, at the bare minimum, not cause serious harm to the organism, meaning that they should for instance not aggregate. Therefore, although the creation of short ORFs could be truly random, the fixation should be subjected to some selective pressure. The selective forces acting on orphan proteins have been elusive, and contradictory results have been reported. In Drosophila young proteins are more disordered than ancient ones, while the opposite trend is present in yeast. To the best of our knowledge no valid explanation for this difference has been proposed.To solve this riddle we studied structural properties and age of proteins in 187 eukaryotic organisms. We find that, with the exception of length, there are only small differences in the properties between proteins of different ages. However, when we take the GC content into account we noted that it could explain the opposite trends observed for orphans in yeast (low GC) and Drosophila (high GC). GC content is correlated with codons coding for disorder promoting amino acids. This leads us to propose that intrinsic disorder is not a strong determining factor for fixation of orphan proteins. Instead these proteins largely resemble random proteins given a particular GC level. During evolution the properties of a protein change faster than the GC level causing the relationship between disorder and GC to gradually weaken.Author SummaryWe show that the GC content of a genome is of great importance for the properties of an orphan protein. GC content affects the frequency of the codons and this affects the probability for each amino acid to be included in a de novo created protein. The codons encoding for Ala, Pro and Gly contain 80% GC, while codons for Lys, Phe, Asn, Tyr and Ile contain 20% or less. The three high GC amino acids are all disorder promoting, while Phe, Tyr and Ile are order promoting. Therefore, random protein sequences at a high GC will be more disordered than the ones created at a low GC. The structural properties of the youngest proteins match to a large degree the properties of random proteins when the GC content is taken into account. In contrast, structural properties of ancient proteins only show a weak correlation with GC content. This suggests that even after fixation in the population, proteins largely resemble random proteins given a certain GC content. Thereafter, during evolution the correlation between structural properties and GC weakens.

Download Full-text

Draft genome assembly data of Anoxybacillus sp. strain MB8 isolated from Tattapani hot springs, India

10.1101/2021.06.09.447659 ◽

2021 ◽

Author(s):

VISHNU PRASOODANAN P K ◽

Shruti S. Menon ◽

Rituja Saxena ◽

Prashant Waiker ◽

Vineet K Sharma

Keyword(s):

Hot Springs ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Central India ◽

Glycoside Hydrolases ◽

Rrna Gene ◽

Aerobic Bacterium ◽

Protein Coding ◽

Protein Coding Genes

Discovery of novel thermophiles has shown promising applications in the field of biotechnology. Due to their thermal stability, they can survive the harsh processes in the industries, which make them important to be characterized and studied. Members of Anoxybacillus are alkaline tolerant thermophiles and have been extensively isolated from manure, dairy-processed plants, and geothermal hot springs. This article reports the assembled data of an aerobic bacterium Anoxybacillus sp. strain MB8, isolated from the Tattapani hot springs in Central India, where the 16S rRNA gene shares an identity of 97% (99% coverage) with Anoxybacillus kamchatkensis strain G10. The de novo assembly and annotation performed on the genome of Anoxybacillus sp. strain MB8 comprises of 2,898,780 bp (in 190 contigs) with a GC content of 41.8% and includes 2,976 protein-coding genes,1 rRNA operon, 73 tRNAs, 1 tm-RNA and 10 CRISPR arrays. The predicted protein-coding genes have been classified into 21 eggNOG categories. The KEGG Automated Annotation Server (KAAS) analysis indicated the presence of assimilatory sulfate reduction pathway, nitrate reducing pathway, and genes for glycoside hydrolases (GHs) and glycoside transferase (GTs). GHs and GTs hold widespread applications, in the baking and food industry for bread manufacturing, and in the paper, detergent and cosmetic industry. Hence, Anoxybacillus sp. strain MB8 holds the potential to be screened and characterized for such commercially relevant enzymes.

Download Full-text

Draft Genome of the Macadamia Husk Spot Pathogen, Pseudocercospora macadamiae

Phytopathology ◽

10.1094/phyto-12-19-0460-a ◽

2020 ◽

Vol 110 (9) ◽

pp. 1503-1506

Author(s):

Olufemi A. Akinsanmi ◽

Lilia C. Carvalhais

Keyword(s):

Plant Disease Resistance ◽

Plant Disease ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Disease Development ◽

Closely Related Species ◽

Protein Coding ◽

Protein Coding Genes ◽

The Family

Pseudocercospora macadamiae causes husk spot in macadamia in Australia. Lack of genomic resources for this pathogen has restricted acquiring knowledge on the mechanism of disease development, spread, and its role in fruit abscission. To address this gap, we sequenced the genome of P. macadamiae. The sequence was de novo assembled into a draft genome of 40 Mb, which is comparable to closely related species in the family Mycosphaerellaceae. The draft genome comprises 212 scaffolds, of which 99 scaffolds are over 50 kb. The genome has a 49% GC content and is predicted to contain 15,430 protein-coding genes. This draft genome sequence is the first for P. macadamiae and represents a valuable resource for understanding genome evolution and plant disease resistance.

Download Full-text

Four high-quality draft genome assemblies of the marine heterotrophic nanoflagellate Cafeteria roenbergensis

10.1101/751586 ◽

2019 ◽

Cited By ~ 1

Author(s):

Thomas Hackl ◽

Roman Martin ◽

Karina Barenhoff ◽

Sarah Duponchel ◽

Dominik Heider ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Gc Content ◽

Illumina Miseq ◽

Evolutionary Analysis ◽

High Quality ◽

Protein Coding ◽

Repeat Content ◽

Genome Assemblies ◽

Cafeteria Roenbergensis

AbstractThe heterotrophic stramenopile Cafeteria roenbergensis is a globally distributed marine bacterivorous protist. This unicellular flagellate is host to the giant DNA virus CroV and the virophage mavirus. We sequenced the genomes of four cultured C. roenbergensis strains and generated 23.53 Gb of Illumina MiSeq data (99-282 × coverage per strain) and 5.09 Gb of PacBio RSII data (13-54 × coverage). Using the Canu assembler and customized curation procedures, we obtained high-quality draft genome assemblies with a total length of 34-36 Mbp per strain and contig N50 lengths of 148 kbp to 464 kbp. The C. roenbergensis genome has a GC content of ~70%, a repeat content of ~28%, and is predicted to contain approximately 7857-8483 protein-coding genes based on a combination of de novo, homology-based and transcriptome-supported annotation. These first high-quality genome assemblies of a Bicosoecid fill an important gap in sequenced Stramenopile representatives and enable a more detailed evolutionary analysis of heterotrophic protists.

Download Full-text

From de novo to ‘de nono’: most novel protein coding genes identified with phylostratigraphy represent old genes or recent duplicates

10.1101/287193 ◽

2018 ◽

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Sequence Similarity ◽

Gc Content ◽

Protein Coding ◽

Protein Coding Genes ◽

Gene Sets ◽

De Novo Genes ◽

De Novo Gene ◽

Similarity Searches ◽

Novel Protein

AbstractThe evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I re-analyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most of the putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that about 60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with non-rodent mammals. These results led to an estimated rate of ∼12 de novo genes per million year in mouse. Contrary to a previous study (Wilson et al. 2017), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.

Download Full-text

The architecture of the Plasmodiophora brassicae nuclear and mitochondrial genomes

Scientific Reports ◽

10.1038/s41598-019-52274-7 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 2

Author(s):

Suzana Stjelja ◽

Johan Fogelqvist ◽

Christian Tellgren-Roth ◽

Christina Dixelius

Keyword(s):

De Novo ◽

Genetic Recombination ◽

Plasmodiophora Brassicae ◽

Phylogenetic Analyses ◽

Gc Content ◽

Nuclear Genome ◽

Protein Coding ◽

Clubroot Disease ◽

Long Read ◽

Mt Genome

Abstract Plasmodiophora brassicae is a soil-borne pathogen that attacks roots of cruciferous plants causing clubroot disease. The pathogen belongs to the Plasmodiophorida order in Phytomyxea. Here we used long-read SMRT technology to clarify the P. brassicae e3 genomic constituents along with comparative and phylogenetic analyses. Twenty contigs representing the nuclear genome and one mitochondrial (mt) contig were generated, together comprising 25.1 Mbp. Thirteen of the 20 nuclear contigs represented chromosomes from telomere to telomere characterized by [TTTTAGGG] sequences. Seven active gene candidates encoding synaptonemal complex-associated and meiotic-related protein homologs were identified, a finding that argues for possible genetic recombination events. The circular mt genome is large (114,663 bp), gene dense and intron rich. It shares high synteny with the mt genome of Spongospora subterranea, except in a unique 12 kb region delimited by shifts in GC content and containing tandem minisatellite- and microsatellite repeats with partially palindromic sequences. De novo annotation identified 32 protein-coding genes, 28 structural RNA genes and 19 ORFs. ORFs predicted in the repeat-rich region showed similarities to diverse organisms suggesting possible evolutionary connections. The data generated here form a refined platform for the next step involving functional analysis, all to clarify the complex biology of P. brassicae.

Download Full-text

The whole-genome sequence analysis of Morchella sextelata

Scientific Reports ◽

10.1038/s41598-019-51831-4 ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Mei-Han ◽

Qingshan-Wang ◽

Baiyintala ◽

Wuhanqimuge

Keyword(s):

De Novo ◽

Gc Content ◽

Gene Clusters ◽

Essential Amino Acids ◽

Biologically Active ◽

Whole Genome Sequence ◽

Daily Consumption ◽

Illumina Hiseq ◽

Protein Coding ◽

Morphological Studies

Abstract Morchella are macrofungi and are also called morels, as they exhibit a morel-like upper cap structure. Morels contain abundant essential amino acids, vitamins and biologically active compounds, which provide substantial health benefits. Approximately 80 species of Morchella have been reported, and even more species have been isolated. However, the lack of wild Morchella resources and the difficulties associated with culturing Morchella have caused a shortage in the morels available for daily consumption. Additionally, in-depth genomic and morphological studies are still needed. In this study, to provide genomic data for further investigations of culturing techniques and the biological functions of Morchella sextelata (M. sextelata), de novo genome sequencing was carried out on the Illumina HiSeq. 4000 platform using both the Illumina 150 and PacBio systems. The final estimated genome size of M. sextelata was 52.93 Mb, containing 59 contigs and a GC content of 47.37%. A total of 9,550 protein-coding genes were annotated. In addition, the repeat sequences, gene components and gene functions were analyzed using various databases. Furthermore, the secondary metabolite gene clusters and the predicted structures of their products were analyzed. Finally, a genomic comparison of different species of Morchella was performed.

Download Full-text

De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data

10.1101/267062 ◽

2018 ◽

Cited By ~ 5

Author(s):

Adam Ameur ◽

Huiwen Che ◽

Marcel Martin ◽

Ignas Bunikis ◽

Johan Dahlberg ◽

...

Keyword(s):

De Novo ◽

Gc Content ◽

Variant Calling ◽

Whole Genome Sequencing Data ◽

Personal Genome ◽

Sequencing Data ◽

Swedish Population ◽

Protein Coding ◽

Sequencing Project ◽

Population Scale

AbstractWe have performed de novo assembly of two Swedish genomes using long-read sequencing and optical mapping, resulting in total assembly sizes of nearly 3 Gb and hybrid scaffold N50 values of over 45 Mb. A further analysis revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have elevated GC-content and are primarily located in centromeric or telomeric regions. A BLAST search showed that 31% of the NS are different from any sequences deposited in nucleotide databases. The remaining NS correspond to human (62%) or primate (6%) nucleotide entries, while 1% of hits show the highest similarity to other species, including mouse and a few different classes of parasitic worms. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are missing from GRCh38 also at chromosomes 14, 17 and 21. Inclusion of these novel sequences into the GRCh38 reference radically improves the alignment and variant calling of whole-genome sequencing data at several genomic loci. Through a re-analysis of 200 samples from a Swedish population-scale sequencing project, we obtained over 75,000 putative novel SNVs per individual when using a custom version of GRCh38 extended with 17.3 Mb of NS. In addition, about 10,000 false positive SNV calls per individual were removed from the GRCh38 autosomes and sex chromosomes in the re-analysis, with some of them located in protein coding regions.

Download Full-text

Hybrid genome de novo assembly with methylome analysis of the anaerobic thermophilic subsurface bacterium Thermanaerosceptrum fracticalcis strain DRI-13T

BMC Genomics ◽

10.1186/s12864-021-07535-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Trevor R. Murphy ◽

Rui Xiao ◽

Scott D. Hamilton-Brehm

Keyword(s):

Dna Methyltransferase ◽

De Novo ◽

Gc Content ◽

Hybrid Assembly ◽

Protein Coding ◽

Transcription Start Sites ◽

Circular Genome ◽

Unknown Protein ◽

Hybrid Genome ◽

Methylome Analysis

Abstract Background There is a dearth of sequenced and closed microbial genomes from environments that exceed > 500 m below level terrestrial surface. Coupled with even fewer cultured isolates, study and understanding of how life endures in the extreme oligotrophic subsurface environments is greatly hindered. Using a de novo hybrid assembly of Illumina and Oxford Nanopore sequences we produced a circular genome with corresponding methylome profile of the recently characterized thermophilic, anaerobic, and fumarate-respiring subsurface bacterium, Thermanaerosceptrum fracticalcis, strain DRI-13T to understand how this microorganism survives the deep subsurface. Results The hybrid assembly produced a single circular genome of 3.8 Mb in length with an overall GC content of 45%. Out of the total 4022 annotated genes, 3884 are protein coding, 87 are RNA encoding genes, and the remaining 51 genes were associated with regulatory features of the genome including riboswitches and T-box leader sequences. Approximately 24% of the protein coding genes were hypothetical. Analysis of strain DRI-13T genome revealed: 1) energy conservation by bifurcation hydrogenase when growing on fumarate, 2) four novel bacterial prophages, 3) methylation profile including 76.4% N6-methyladenine and 3.81% 5-methylcytosine corresponding to novel DNA methyltransferase motifs. As well a cluster of 45 genes of unknown protein families that have enriched DNA mCpG proximal to the transcription start sites, and 4) discovery of a putative core of bacteriophage exclusion (BREX) genes surrounded by hypothetical proteins, with predicted functions as helicases, nucleases, and exonucleases. Conclusions The de novo hybrid assembly of strain DRI-13T genome has provided a more contiguous and accurate view of the subsurface bacterium T. fracticalcis, strain DRI-13T. This genome analysis reveals a physiological focus supporting syntrophy, non-homologous double stranded DNA repair, mobility/adherence/chemotaxis, unique methylome profile/recognized motifs, and a BREX defense system. The key to microbial subsurface survival may not rest on genetic diversity, but rather through specific syntrophy niches and novel methylation strategies.

Download Full-text

Faculty Opinions recommendation of Predictions of Backbone Dynamics in Intrinsically Disordered Proteins Using De Novo Fragment-Based Protein Structure Predictions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727869450.793535408 ◽

2017 ◽

Author(s):

Vladimir Uversky

Keyword(s):

Protein Structure ◽

Intrinsically Disordered Proteins ◽

De Novo ◽

Backbone Dynamics ◽

Disordered Proteins ◽

Intrinsically Disordered

Download Full-text