scholarly journals OMGene: Mutual improvement of gene models through optimisation of evolutionary conservation

2017 ◽  
Author(s):  
Michael P. Dunne ◽  
Steven Kelly

AbstractBackgroundThe accurate determination of the genomic coordinates for a given gene – its gene model – is of vital importance to the utility of its annotation, and the accuracy of bioinformatic analyses derived from it. Currently-available methods of computational gene prediction, while on the whole successful, often disagree on the model for a given predicted gene, with some or all of the variant gene models failing to match the biologically observed structure. Many prediction methods can be bolstered by using experimental data such as RNA-seq and mass spectrometry. However, these resources are not always available, and rarely give a comprehensive portrait of an organism’s transcriptome due to temporal and tissue-specific expression profiles.ResultsOrthology between genes provides evolutionary evidence to guide the construction of gene models. OMGene (Optimise My Gene) aims to optimise gene models in the absence of experimental data by optimising the derived amino acid alignments for gene models within orthogroups. Using RNA-seq data sets from plants and fungi, considering intron/exon junction representation and exon coverage, and assessing the intra-orthogroup consistency of subcellular localisation predictions, we demonstrate the utility of OMGene for improving gene models in annotated genomes.ConclusionsWe show that significant improvements in the accuracy of gene model annotations can be made in both established and de novo annotated genomes by leveraging information from multiple species.

Plants ◽  
2021 ◽  
Vol 10 (7) ◽  
pp. 1465
Author(s):  
Ramon de Koning ◽  
Raphaël Kiekens ◽  
Mary Esther Muyoka Toili ◽  
Geert Angenon

Raffinose family oligosaccharides (RFO) play an important role in plants but are also considered to be antinutritional factors. A profound understanding of the galactinol and RFO biosynthetic gene families and the expression patterns of the individual genes is a prerequisite for the sustainable reduction of the RFO content in the seeds, without compromising normal plant development and functioning. In this paper, an overview of the annotation and genetic structure of all galactinol- and RFO biosynthesis genes is given for soybean and common bean. In common bean, three galactinol synthase genes, two raffinose synthase genes and one stachyose synthase gene were identified for the first time. To discover the expression patterns of these genes in different tissues, two expression atlases have been created through re-analysis of publicly available RNA-seq data. De novo expression analysis through an RNA-seq study during seed development of three varieties of common bean gave more insight into the expression patterns of these genes during the seed development. The results of the expression analysis suggest that different classes of galactinol- and RFO synthase genes have tissue-specific expression patterns in soybean and common bean. With the obtained knowledge, important galactinol- and RFO synthase genes that specifically play a key role in the accumulation of RFOs in the seeds are identified. These candidate genes may play a pivotal role in reducing the RFO content in the seeds of important legumes which could improve the nutritional quality of these beans and would solve the discomforts associated with their consumption.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Inés González-Castellano ◽  
Chiara Manfrin ◽  
Alberto Pallavicini ◽  
Andrés Martínez-Lage

Abstract Background The common littoral shrimp Palaemon serratus is an economically important decapod resource in some European communities. Aquaculture practices prevent the genetic deterioration of wild stocks caused by overfishing and at the same time enhance the production. The biotechnological manipulation of sex-related genes has the proved potential to improve the aquaculture production but the scarcity of genomic data about P. serratus hinders these applications. RNA-Seq analysis has been performed on ovary and testis samples to generate a reference gonadal transcriptome. Differential expression analyses were conducted between three ovary and three testis samples sequenced by Illumina HiSeq 4000 PE100 to reveal sex-related genes with sex-biased or sex-specific expression patterns. Results A total of 224.5 and 281.1 million paired-end reads were produced from ovary and testis samples, respectively. De novo assembly of ovary and testis trimmed reads yielded a transcriptome with 39,186 transcripts. The 29.57% of the transcriptome retrieved at least one annotation and 11,087 differentially expressed genes (DEGs) were detected between ovary and testis replicates. Six thousand two hundred seven genes were up-regulated in ovaries meanwhile 4880 genes were up-regulated in testes. Candidate genes to be involved in sexual development and gonadal development processes were retrieved from the transcriptome. These sex-related genes were discussed taking into account whether they were up-regulated in ovary, up-regulated in testis or not differentially expressed between gonads and in the framework of previous findings in other crustacean species. Conclusions This is the first transcriptome analysis of P. serratus gonads using RNA-Seq technology. Interesting findings about sex-related genes from an evolutionary perspective (such as Dmrt1) and for putative future aquaculture applications (Iag or vitellogenesis genes) are reported here. We provide a valuable dataset that will facilitate further research into the reproductive biology of this shrimp.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Yannick Cogne ◽  
Davide Degli-Esposti ◽  
Olivier Pible ◽  
Duarte Gouveia ◽  
Adeline François ◽  
...  

Abstract Gammarids are amphipods found worldwide distributed in fresh and marine waters. They play an important role in aquatic ecosystems and are well established sentinel species in ecotoxicology. In this study, we sequenced the transcriptomes of a male individual and a female individual for seven different taxonomic groups belonging to the two genera Gammarus and Echinogammarus: Gammarus fossarum A, G. fossarum B, G. fossarum C, Gammarus wautieri, Gammarus pulex, Echinogammarus berilloni, and Echinogammarus marinus. These taxa were chosen to explore the molecular diversity of transcribed genes of genotyped individuals from these groups. Transcriptomes were de novo assembled and annotated. High-quality assembly was confirmed by BUSCO comparison against the Arthropod dataset. The 14 RNA-Seq-derived protein sequence databases proposed here will be a significant resource for proteogenomics studies of these ecotoxicologically relevant non-model organisms. These transcriptomes represent reliable reference sequences for whole-transcriptome and proteome studies on other gammarids, for primer design to clone specific genes or monitor their specific expression, and for analyses of molecular differences between gammarid species.


2020 ◽  
Author(s):  
Xinlu Yuan ◽  
Jianjun Diao ◽  
Anqing Du ◽  
Song Wen ◽  
Ligang Zhou ◽  
...  

Abstract Background: Nonalcoholic fatty liver disease (NAFLD) is primarily characterized by the hepatic cholesterol accumulation. Circular RNA (circRNA), one of noncoding RNA, involves in many liver diseases progression. However, no recent studies on circRNA expression profiles in NAFLD have been reported previously.Methods: A NAFLD mouse model was constructed by providing high-fat diet (HFD) for 32 weeks. The circRNAs expression profile in normal mice and NAFLD mice were determined using high-output RNA sequencing method and bioinformatics methods, while the differentially expressed circRNAs were confirmed using Sanger sequencing and qRT-PCR. The circRNA-miRNA network was also predicted. The biological functions of circRNAs were annotated by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG).Results: The results demonstrated the successful construction of NAFLD mice model by immunohistology and serology assay. In total, 93 dysregulated circRNAs were observed, including 57 upregulated circRNAs and 36 downregulated circRNAs, in the NAFLD group. The circRNA-miRNA network revealed the complex interaction between circRNAs and its potential miRNA targets in NAFLD. The characteristic of tissue-specific expression in circRNA was demonstrated. The differentially expressed circRNAs with important biological function were also annotated using GO and KEGG. Both DDAH1 and VAV3 genes were found to be associated with the NAFLD development.Conclusions: Taken together, this study demonstrated the circRNAs expression profile and features in NAFLD, which may provide potential biological markers for the pathogenesis of NAFLD.


2020 ◽  
Author(s):  
Abolfazl Doostparast Torshizi ◽  
Jubao Duan ◽  
Kai Wang

AbstractThe importance of cell type-specific gene expression in disease-relevant tissues is increasingly recognized in genetic studies of complex diseases. However, the vast majority of gene expression studies are conducted on bulk tissues, necessitating computational approaches to infer biological insights on cell type-specific contribution to diseases. Several computational methods are available for cell type deconvolution (that is, inference of cellular composition) from bulk RNA-Seq data, but cannot impute cell type-specific expression profiles. We hypothesize that with external prior information such as single cell RNA-seq (scRNA-seq) and population-wide expression profiles, it can be a computationally tractable and identifiable to estimate both cellular composition and cell type-specific expression from bulk RNA-Seq data. Here we introduce CellR, which addresses cross-individual gene expression variations by employing genome-wide tissue-wise expression signatures from GTEx to adjust the weights of cell-specific gene markers. It then transforms the deconvolution problem into a linear programming model while taking into account inter/intra cellular correlations, and uses a multi-variate stochastic search algorithm to estimate the expression level of each gene in each cell type. Extensive analyses on several complex diseases such as schizophrenia, Alzheimer’s disease, Huntington’s disease, and type 2 diabetes validated efficiency of CellR, while revealing how specific cell types contribute to different diseases. We conducted numerical simulations on human cerebellum to generate pseudo-bulk RNA-seq data and demonstrated its efficiency in inferring cell-specific expression profiles. Moreover, we inferred cell-specific expression levels from bulk RNA-seq data on schizophrenia and computed differentially expressed genes within certain cell types. Using predicted gene expression profile on excitatory neurons, we were able to reproduce our recently published findings on TCF4 being a master regulator in schizophrenia and showed how this gene and its targets are enriched in excitatory neurons. In summary, CellR compares favorably (both accuracy and stability of inference) against competing approaches on inferring cellular composition from bulk RNA-seq data, but also allows direct imputation of cell type-specific gene expression, opening new doors to re-analyze gene expression data on bulk tissues in complex diseases.


2021 ◽  
Vol 12 ◽  
Author(s):  
Pengpeng Zhang ◽  
Mingxuan Sheng ◽  
Chunyu Du ◽  
Zhe Chao ◽  
Haixia Xu ◽  
...  

Brown adipose tissue (BAT) is specialized for energy expenditure, thus a better understanding of the regulators influencing BAT development could provide novel strategies to defense obesity. Many protein-coding genes, miRNAs, and lncRNAs have been investigated in BAT development, however, the expression patterns and functions of circRNA in brown adipogenesis have not been reported yet. This study determined the circRNA expression profiles across brown adipogenesis (proliferation, early differentiated, and fully differentiated stages) by RNA-seq. We identified 3,869 circRNAs and 36.9% of them were novel. We found the biogenesis of circRNA was significantly related to linear mRNA transcription, meanwhile, almost 70% of circRNAs were generated by alternative back-splicing. Next, we examined the cell-specific and differentiation stage-specific expression of circRNAs. Compared to white adipocytes, nearly 30% of them were specifically expressed in brown adipocytes. Further, time-series expression analysis showed circRNAs were dynamically expressed, and 117 differential expression circRNAs (DECs) in brown adipogenesis were identified, with 77 upregulated and 40 downregulated. Experimental validation showed the identified circRNAs could be successfully amplified and the expression levels detected by RNA-seq were reliable. For the potential functions of the circRNAs, GO analysis suggested that the decreased circRNAs were enriched in cell proliferation terms, while the increased circRNAs were enriched in development and thermogenic terms. Bioinformatics predictions showed that DECs contained numerous binding sites of functional miRNAs. More interestingly, most of the circRNAs contained multiple binding sites for the same miRNA, indicating that they may facilitate functions by acting as microRNA sponges. Collectively, we characterized the circRNA expression profiles during brown adipogenesis and provide numerous novel circRNAs candidates for future brown adipogenesis regulating studies.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2021 ◽  
Author(s):  
Satoshi Okubo ◽  
Kaede Terauchi ◽  
Shinji Okada ◽  
Takao Yamaura ◽  
Takumi Misaka ◽  
...  

Abstract Background Curculigo latifolia is a perennial plant endogenous to Southeast Asia whose fruits contain the taste-modifying protein neoculin, which binds to sweet receptors and makes sour fruits taste sweet. Although similar to snowdrop (Galanthus nivalis) agglutinin (GNA), which contains mannose-binding sites in its sequence and 3D structure, neoculin lacks such sites and has no lectin activity. Whether the fruits of C. latifolia and other Curculigo plants contain neoculin and/or GNA family members was unclear. Results Through de novo RNA-seq assembly of the fruits of C. latifolia and the related C. capitulata and detailed analysis of the expression patterns of neoculin and neoculin-like genes in both species, we assembled 85,697 transcripts from C. latifolia and 76,775 from C. capitulata using Trinity and annotated them using public databases. We identified 70,371 unigenes in C. latifolia and 63,704 in C. capitulata. In total, 38.6% of unigenes from C. latifolia and 42.6% from C. capitulata shared high similarity between the two species. We identified ten neoculin-related transcripts in C. latifolia and 15 in C. capitulata, encoding both the basic and acidic subunits of neoculin in both plants. We aligned these 25 transcripts and generated a phylogenetic tree. Many orthologs in the two species shared high similarity, despite the low number of common genes, suggesting that these genes likely existed before the two species diverged. The relative expression levels of these genes differed considerably between the two species: the transcripts per million (TPM) values of neoculin genes were 60 times higher in C. latifolia than in C. capitulata, whereas those of GNA family members were 15,000 times lower in C. latifolia than in C. capitulata. Conclusions The genetic diversity of neoculin-related genes strongly suggests that neoculin genes underwent duplication during evolution. The marked differences in their expression profiles between C. latifolia and C. capitulata may be due to mutations in regions involved in transcriptional regulation. Comprehensive analysis of the genes expressed in the fruits of these two Curculigo species helped elucidate the origin of neoculin at the molecular level.


2020 ◽  
Author(s):  
Xinlu Yuan ◽  
Jianjun Diao ◽  
Anqing Du ◽  
Song Wen ◽  
Ligang Zhou ◽  
...  

Abstract Background: Nonalcoholic fatty liver disease (NAFLD) is primarily characterized by the hepatic cholesterol accumulation. Circular RNA (circRNA), one of noncoding RNA, involves in many liver diseases progression. However, no recent studies on circRNA expression profiles in NAFLD have been reported previously. Methods: A NAFLD mouse model was constructed by providing high-fat diet (HFD) for 32 weeks. The circRNAs expression profile in normal mice and NAFLD mice were determined using high-output RNA sequencing method and bioinformatics methods, while the differentially expressed circRNAs were confirmed using Sanger sequencing and qRT-PCR. The circRNA-miRNA network was also predicted. The biological functions of circRNAs were annotated by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Results: The results demonstrated the successful construction of NAFLD mice model by immunohistology and serology assay. In total, 93 dysregulated circRNAs were observed, including 57 upregulated circRNAs and 36 downregulated circRNAs, in the NAFLD group. The circRNA-miRNA network revealed the complex interaction between circRNAs and its potential miRNA targets in NAFLD. The characteristic of tissue-specific expression in circRNA was demonstrated. The differentially expressed circRNAs with important biological function were also annotated using GO and KEGG. Both DDAH1 and VAV3 genes were found to be associated with the NAFLD development. Conclusions: Taken together, this study demonstrated the circRNAs expression profile and features in NAFLD, which may provide potential biological markers for the pathogenesis of NAFLD.


Sign in / Sign up

Export Citation Format

Share Document