EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

Mapping Intimacies ◽

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text

De Novo Whole-Genome Sequencing of the Wood Rot Fungus Polyporus brumalis, Which Exhibits Potential Terpenoid Metabolism

Genome Announcements ◽

10.1128/genomea.00586-17 ◽

2017 ◽

Vol 5 (28) ◽

Author(s):

Su-Yeon Lee ◽

Ji-eun An ◽

Sun-Hwa Ryu ◽

Myungkil Kim

Keyword(s):

Single Molecule ◽

De Novo ◽

Gene Annotation ◽

Draft Genome ◽

Fungal Growth ◽

Protein Coding ◽

Sequencing Platform ◽

Protein Coding Genes ◽

Polyporus Brumalis ◽

Terpenoid Metabolism

ABSTRACT Polyporus brumalis is able to synthesize several sesquiterpenes during fungal growth. Using a single-molecule real-time sequencing platform, we present the 53-Mb draft genome of P. brumalis, which contains 6,231 protein-coding genes. Gene annotation and isolation support genetic information, which can increase the understanding of sesquiterpene metabolism in P. brumalis.

Download Full-text

The Draft Genome of the Endangered Sichuan Partridge (Arborophila rufipectus) with Evolutionary Implications

Genes ◽

10.3390/genes10090677 ◽

2019 ◽

Vol 10 (9) ◽

pp. 677 ◽

Cited By ~ 1

Author(s):

Chuang Zhou ◽

Hongmei Tu ◽

Haoran Yu ◽

Shuai Zheng ◽

Bo Dai ◽

...

Keyword(s):

De Novo ◽

Genome Structure ◽

Draft Genome ◽

Enrichment Analysis ◽

Phylogenetic Position ◽

Nucleotide Polymorphisms ◽

Protein Coding ◽

Go Enrichment ◽

And Behavior ◽

Or Genes

The Sichuan partridge (Arborophila rufipectus, Phasianidae, Galliformes) is distributed in south-west China, and classified as endangered grade. To examine the evolution and genomic features of Sichuan partridge, we de novo assembled the Sichuan partridge reference genome. The final draft assembly consisted of approximately 1.09 Gb, and had a scaffold N50 of 4.57 Mb. About 1.94 million heterozygous single-nucleotide polymorphisms (SNPs) were detected, 17,519 protein-coding genes were predicted, and 9.29% of the genome was identified as repetitive elements. A total of 56 olfactory receptor (OR) genes were found in Sichuan partridge, and conserved motifs were detected. Comparisons between the Sichuan partridge genome and chicken genome revealed a conserved genome structure, and phylogenetic analysis demonstrated that Arborophila possessed a basal phylogenetic position within Phasianidae. Gene Ontology (GO) enrichment analysis of positively selected genes (PSGs) in Sichuan partridge showed over-represented GO functions related to environmental adaptation, such as energy metabolism and behavior. Pairwise sequentially Markovian coalescent analysis revealed the recent demographic trajectory for the Sichuan partridge. Our data and findings provide valuable genomic resources not only for studying the evolutionary adaptation, but also for facilitating the long-term conservation and genetic diversity for this endangered species.

Download Full-text

Molecular Characterization of Donacia provosti (Coleoptera: Chrysomelidae) Larval Transcriptome by De Novo Assembly to Discover Genes Associated with Underwater Environmental Adaptations

Insects ◽

10.3390/insects12040281 ◽

2021 ◽

Vol 12 (4) ◽

pp. 281

Author(s):

Haixia Zhan ◽

Youssef Dewer ◽

Cheng Qu ◽

Shiyong Yang ◽

Chen Luo ◽

...

Keyword(s):

Molecular Mechanisms ◽

De Novo ◽

Orthologous Gene ◽

Leaf Beetle ◽

Scientific Basis ◽

Protein Coding ◽

Effective Prevention ◽

Gene Pairs ◽

Major Pest ◽

And Control

Donacia provosti (Fairmaire, 1885) is a major pest of aquatic crops. It has been widely distributed in the world causing extensive damage to lotus and rice plants. Changes in gene regulation may play an important role in adaptive evolution, particularly during adaptation to feeding and living habits. However, little is known about the evolution and molecular mechanisms underlying the adaptation of D. provosti to its lifestyle and living habits. To address this question, we generated the first larval transcriptome of D. provosti. A total of 20,692 unigenes were annotated from the seven public databases and around 18,536 protein-coding genes have been predicted from the analysis of D. provosti transcriptome. About 5036 orthologous cutlers were identified among four species and 494 unique clusters were identified from D. provosti larvae including the visual perception. Furthermore, to reveal the molecular difference between D. provosti and the Colorado potato beetle Leptinotarsa decemlineata, a comparison between CDS of the two beetles was conducted and 6627 orthologous gene pairs were identified. Based on the ratio of nonsynonymous and synonymous substitutions, 93 orthologous gene pairs were found evolving under positive selection. Interestingly, our results also show that there are 4 orthologous gene pairs of the 93 gene pairs were associated with the “mTOR signaling pathway”, which are predicted to be involved in the molecular mechanism of D. provosti adaptation to the underwater environment. This study will provide us with an important scientific basis for building effective prevention and control system of the aquatic leaf beetle Donacia provosti.

Download Full-text

FertilityOnline, a straight pipeline for functional gene annotation and disease mutation discovery, identifies novel infertility causative mutations in SYCE1 and STAG3

10.1101/2020.08.05.238162 ◽

2020 ◽

Author(s):

Jianing Gao ◽

Huan Zhang ◽

Xiaohua Jiang ◽

Asim Ali ◽

Daren Zhao ◽

...

Keyword(s):

Genetic Basis ◽

Animal Species ◽

Gene Annotation ◽

Enrichment Analysis ◽

Web Interface ◽

Mutation Discovery ◽

Human Genes ◽

Intensive Investigation ◽

User Friendly ◽

Functional Gene Annotation

AbstractExploring the genetic basis of human infertility is currently under intensive investigation. However, only a handful of genes are validated in animal models as disease-causing genes in infertile men. Thus, to better understand the genetic basis of spermatogenesis in human and to bridge the knowledge gap between human and other animal species, we have constructed FertilityOnline database, which is a resource that integrates the functional genes reported in literature related to spermatogenesis into an existing spermatogenic database, SpermatogenesisOnline 1.0. Additional features like functional annotation and statistical analysis of genetic variants of human genes, are also incorporated into FertilityOnline. By searching this database, users can focus on the top candidate genes associated with infertility and can perform enrichment analysis to instantly refine the number of candidates in a user-friendly web interface. Clinical validation of this database is established by the identification of novel causative mutations in SYCE1 and STAG3 in azoospermia men. In conclusion, FertilityOnline is not only an integrated resource for analysis of spermatogenic genes, but also a useful tool that facilitates to study underlying genetic basis of male infertility.AvailabilityFertilityOnline can be freely accessed at http://mcg.ustc.edu.cn/bsc/spermgenes2.0/index.html.

Download Full-text

GeneMark-HM: improving gene prediction in DNA sequences of human microbiome

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab047 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Alexandre Lomsadze ◽

Christophe Bonny ◽

Francesco Strozzi ◽

Mark Borodovsky

Keyword(s):

Dna Sequences ◽

De Novo ◽

Gene Annotation ◽

Selection Process ◽

Gene Prediction ◽

Bacterial Species ◽

Human Microbiome ◽

Genome Database ◽

Protein Coding ◽

Pan Genome

Abstract Computational reconstruction of nearly complete genomes from metagenomic reads may identify thousands of new uncultured candidate bacterial species. We have shown that reconstructed prokaryotic genomes along with genomes of sequenced microbial isolates can be used to support more accurate gene prediction in novel metagenomic sequences. We have proposed an approach that used three types of gene prediction algorithms and found for all contigs in a metagenome nearly optimal models of protein-coding regions either in libraries of pre-computed models or constructed de novo. The model selection process and gene annotation were done by the new GeneMark-HM pipeline. We have created a database of the species level pan-genomes for the human microbiome. To create a library of models representing each pan-genome we used a self-training algorithm GeneMarkS-2. Genes initially predicted in each contig served as queries for a fast similarity search through the pan-genome database. The best matches led to selection of the model for gene prediction. Contigs not assigned to pan-genomes were analyzed by crude, but still accurate models designed for sequences with particular GC compositions. Tests of GeneMark-HM on simulated metagenomes demonstrated improvement in gene annotation of human metagenomic sequences in comparison with the current state-of-the-art gene prediction tools.

Download Full-text

Opposite Polarity Monospore Genome De Novo Sequencing and Comparative Analysis Reveal the Possible Heterothallic Life Cycle of Morchella importuna

International Journal of Molecular Sciences ◽

10.3390/ijms19092525 ◽

2018 ◽

Vol 19 (9) ◽

pp. 2525 ◽

Cited By ~ 8

Author(s):

Wei Liu ◽

LianFu Chen ◽

YingLi Cai ◽

QianQian Zhang ◽

YinBing Bian

Keyword(s):

Mating Type ◽

De Novo ◽

Physical Map ◽

Gene Annotation ◽

Divergence Time ◽

Enrichment Analysis ◽

Single Copy ◽

Opposite Polarity ◽

De Novo Sequencing ◽

Edible Fungus

Morchella is a popular edible fungus worldwide due to its rich nutrition and unique flavor. Many research efforts were made on the domestication and cultivation of Morchella all over the world. In recent years, the cultivation of Morchella was successfully commercialized in China. However, the biology is not well understood, which restricts the further development of the morel fungus cultivation industry. In this paper, we performed de novo sequencing and assembly of the genomes of two monospores with a different mating type (M04M24 and M04M26) isolated from the commercially cultivated strain M04. Gene annotation and comparative genome analysis were performed to study differences in CAZyme (Carbohydrate-active enzyme) enzyme content, transcription factors, duplicated sequences, structure of mating type sites, and differences at the gene and functional levels between the two monospore strains of M. importuna. Results showed that the de novo assembled haploid M04M24 and M04M26 genomes were 48.98 and 51.07 Mb, respectively. A complete fine physical map of M. importuna was obtained from genome coverage and gene completeness evaluation. A total of 10,852 and 10,902 common genes and 667 and 868 endemic genes were identified from the two monospore strains, respectively. The Gene Ontology (GO) and KAAS (KEGG Automatic Annotation Serve) enrichment analyses showed that the endemic genes performed different functions. The two monospore strains had 99.22% collinearity with each other, accompanied with certain position and rearrangement events. Analysis of complete mating-type loci revealed that the two monospore M. importuna strains contained an independent mating-type structure and remained conserved in sequence and location. The phylogenetic and divergence time of M. importuna was analyzed at the whole-genome level for the first time. The bifurcation time of morel and tuber was estimated to be 201.14 million years ago (Mya); the two monospore strains with a different mating type represented the evolution of different nuclei, and the single copy homologous genes between them were also different due to a genetic differentiation distance about 0.65 Mya. Compared with truffles, M. importuna had an extension of 28 clusters of orthologous genes (COGs) and a contraction of two COGs. The two different polar nuclei with different degrees of contraction and expansion suggested that they might have undergone different evolutionary processes. The different mating-type structures, together with the functional clustering and enrichment analysis results of the endemic genes of the two different polar nuclei, imply that M. importuna might be a heterothallic fungus and the interaction between the endemic genes may be necessary for its complete life history. Studies on the genome of M. importuna facilitate a better understanding of morel biology and evolution.

Download Full-text

Transcriptome sequencing and drought resistance gene annotation in Quercus liaotungensis leaves

Acta Physiologiae Plantarum ◽

10.1007/s11738-021-03294-2 ◽

2021 ◽

Vol 43 (8) ◽

Author(s):

Guobao Wang ◽

Li Qin

Keyword(s):

Candidate Genes ◽

Drought Resistance ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Pathway Enrichment Analysis ◽

Sequencing Platform ◽

Drought Avoidance ◽

Transcriptomic Sequencing ◽

Genes Encoding

AbstractQ. liaotungensis is an important drought-resistant tree species in Northeast China where the climate is dry and rainless. In this study, we performed a deep transcriptomic sequencing in Q. liaotungensis leaves, including de novo assembly and functional annotation for screening the candidate genes involved in drought avoidance. A total of 25,593 unigenes were obtained from Illumina sequencing platform. According to Gene Ontology annotation and KEGG pathway enrichment analysis, we screened a series of candidate genes encoding SOD, POD, CAT, DREB, MYB, WRKY, bZIP, and NAC from the Q. liaotungensis leaf transcriptome, all of which are potentially involved in drought resistance. The results of this study expanded the genetic resources of Q. liaotungensis and provided a theoretical basis for further exploring the functional gene information of Q. liaotungensis.

Download Full-text

Genome Assembly and Transcriptome Analysis of the Fungus Coniella diplodiella During Infection on Grapevine (Vitis vinifera L.)

Frontiers in Microbiology ◽

10.3389/fmicb.2020.599150 ◽

2021 ◽

Vol 11 ◽

Author(s):

Ruitao Liu ◽

Yiming Wang ◽

Peng Li ◽

Lei Sun ◽

Jianfu Jiang ◽

...

Keyword(s):

Molecular Mechanisms ◽

De Novo ◽

Enrichment Analysis ◽

Effector Proteins ◽

White Rot ◽

Vitis Vinifera L ◽

Protein Coding ◽

Metabolite Synthesis ◽

Serious Disease ◽

Genome Information

Grape white rot caused by Coniella diplodiella (Speg.) affects the production and quality of grapevine in China and other grapevine-growing countries. Despite the importance of C. diplodiella as a serious disease-causing agent in grape, the genome information and molecular mechanisms underlying its pathogenicity are poorly understood. To bridge this gap, 40.93 Mbp of C. diplodiella strain WR01 was de novo assembled. A total of 9,403 putative protein-coding genes were predicted. Among these, 608 and 248 genes are potentially secreted proteins and candidate effector proteins (CEPs), respectively. Additionally, the transcriptome of C. diplodiella was analyzed after feeding with crude grapevine leaf homogenates, which reveals the transcriptional expression of 9,115 genes. Gene ontology enrichment analysis indicated that the highly enriched genes are related with carbohydrate metabolism and secondary metabolite synthesis. Forty-three putative effectors were cloned from C. diplodiella, and applied for further functional analysis. Among them, one protein exhibited strong effect in the suppression of BCL2-associated X (BAX)-induced hypersensitive response after transiently expressed in Nicotiana benthamiana leaves. This work facilitates valuable genetic basis for understanding the molecular mechanism underlying C. diplodiella-grapevine interaction.

Download Full-text

A map of constrained coding regions in the human genome

10.1101/220814 ◽

2017 ◽

Cited By ~ 8

Author(s):

James M. Havrilla ◽

Brent S. Pedersen ◽

Ryan M. Layer ◽

Aaron R. Quinlan

Keyword(s):

Human Genome ◽

Developmental Disorders ◽

De Novo ◽

Purifying Selection ◽

Protein Domain ◽

De Novo Mutations ◽

Protein Coding ◽

Constrained Coding ◽

Coding Regions ◽

Pathogenic Variants

ABSTRACTDeep catalogs of genetic variation collected from many thousands of humans enable the detection of intraspecies constraint by revealing coding regions with a scarcity of variation. While existing techniques summarize constraint for entire genes, single metrics cannot capture the fine-scale variability in constraint within each protein-coding gene. To provide greater resolution, we have created a detailed map of constrained coding regions (CCRs) in the human genome by leveraging coding variation observed among 123,136 humans from the Genome Aggregation Database (gnomAD). The most constrained coding regions in our map are enriched for both pathogenic variants in ClinVar and de novo mutations underlying developmental disorders. CCRs also reveal protein domain families under high constraint, suggest unannotated or incomplete protein domains, and facilitate the prioritization of previously unseen variation in studies of disease. Finally, a subset of CCRs with the highest constraint likely exist within genes that cause yet unobserved human phenotypes owing to strong purifying selection.

Download Full-text

Genome Sequence of the Asian Honeybee in Pakistan Sheds Light on Its Phylogenetic Relationship with Other Honeybees

Insects ◽

10.3390/insects12070652 ◽

2021 ◽

Vol 12 (7) ◽

pp. 652

Author(s):

Hongwei Tan ◽

Muhammad Naeem ◽

Hussain Ali ◽

Muhammad Shakeel ◽

Haiou Kuang ◽

...

Keyword(s):

Phylogenetic Relationship ◽

Genome Sequence ◽

Apis Cerana ◽

Gc Content ◽

Protein Domain ◽

Pollination Services ◽

Protein Coding ◽

Close Relationship ◽

Genome Scale ◽

Asian Honeybee

In Pakistan, Apis cerana, the Asian honeybee, has been used for honey production and pollination services. However, its genomic makeup and phylogenetic relationship with those in other countries are still unknown. We collected A. cerana samples from the main cerana-keeping region in Pakistan and performed whole genome sequencing. A total of 28 Gb of Illumina shotgun reads were generated, which were used to assemble the genome. The obtained genome assembly had a total length of 214 Mb, with a GC content of 32.77%. The assembly had a scaffold N50 of 2.85 Mb and a BUSCO completeness score of 99%, suggesting a remarkably complete genome sequence for A. cerana in Pakistan. A MAKER pipeline was employed to annotate the genome sequence, and a total of 11,864 protein-coding genes were identified. Of them, 6750 genes were assigned at least one GO term, and 8813 genes were annotated with at least one protein domain. Genome-scale phylogeny analysis indicated an unexpectedly close relationship between A. cerana in Pakistan and those in China, suggesting a potential human introduction of the species between the two countries. Our results will facilitate the genetic improvement and conservation of A. cerana in Pakistan.

Download Full-text