A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

Mapping Intimacies ◽

10.1101/115345 ◽

2017 ◽

Author(s):

Megan J. Bowman ◽

Jane A. Pulman ◽

Tiffany L. Liu ◽

Kevin L. Childs

Keyword(s):

Oryza Sativa ◽

Gene Annotation ◽

Gene Prediction ◽

Biological Significance ◽

Gc Content ◽

Training Data ◽

Structural Annotation ◽

Gene Variation ◽

A Genome ◽

Grass Genomes

AbstractAccurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. We find that gene prediction programs trained on genes with random GC content do not completely predict all grass genes with extreme GC content. We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa.

Download Full-text

GPRED-GC: a Gene PREDiction model accounting for 5 ′- 3′ GC gradient

BMC Bioinformatics ◽

10.1186/s12859-019-3047-3 ◽

2019 ◽

Vol 20 (S15) ◽

Cited By ~ 1

Author(s):

Prapaporn Techa-Angkoon ◽

Kevin L. Childs ◽

Yanni Sun

Keyword(s):

Ab Initio ◽

Gene Annotation ◽

Gene Prediction ◽

Source Code ◽

Gc Content ◽

Prediction Tools ◽

Homologous Sequences ◽

Manual Intervention ◽

Grass Genomes ◽

Gc Contents

Abstract Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.

Download Full-text

Novel metrics for quantifying bacterial genome composition skews

10.1101/176370 ◽

2017 ◽

Author(s):

Lena M. Joesch-Cohen ◽

Max Robinson ◽

Neda Jabbari ◽

Christopher Lausted ◽

Gustavo Glusman

Keyword(s):

Gene Annotation ◽

Bacterial Species ◽

Bacterial Genome ◽

Gc Content ◽

Bacterial Genomes ◽

Genome Composition ◽

Single Genome ◽

A Genome ◽

Dna Strands ◽

Interactive Visualizations

AbstractBackgroundBacterial genomes have characteristic compositional skews, which are differences in nucleotide frequency between the leading and lagging DNA strands across a segment of a genome. It is thought that these strand asymmetries arise as a result of mutational biases and selective constraints, particularly for energy efficiency. Analysis of compositional skews in a diverse set of bacteria provides a comparative context in which mutational and selective environmental constraints can be studied. These analyses typically require finished and well-annotated genomic sequences.ResultsWe present three novel metrics for examining genome composition skews; all three metrics can be computed for unfinished or partially-annotated genomes. The first two metrics, (dot-skew and cross-skew) depend on sequence and gene annotation of a single genome, while the third metric (residual skew) highlights unusual genomes by subtracting a GC content-based model of a library of genome sequences. We applied these metrics to all 7738 available bacterial genomes, including partial drafts, and identified outlier species. A number of these outliers (i.e., Borrelia, Ehrlichia, Kinetoplastibacterium, and Phytoplasma) display similar skew patterns despite only distant phylogenetic relationship. While unrelated, some of the outlier bacterial species share lifestyle characteristics, in particular intracellularity and biosynthetic dependence on their hosts.ConclusionsOur novel metrics appear to reflect the effects of biosynthetic constraints and adaptations to life within one or more hosts on genome composition. We provide results for each analyzed genome, software and interactive visualizations at http://db.systemsbiology.net/gestalt/skew_metrics.

Download Full-text

Protein length distribution is remarkably consistent across Life

10.1101/2021.12.03.470944 ◽

2021 ◽

Author(s):

Yannis Nevers ◽

Natasha Glover ◽

Christophe Dessimoz ◽

Odile Lecompte

Keyword(s):

Gene Annotation ◽

Length Distribution ◽

Gc Content ◽

Purifying Selection ◽

Genomic Features ◽

Protein Length ◽

Open Questions ◽

Size Number ◽

A Genome ◽

Living Species

AbstractIn every living species, the function of a protein depends on its organisation of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species. Here we evaluated this diversity by comparing protein length distribution across 2,326 species (1,688 bacteria, 153 archaea and 485 eukaryotes). We found that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more consistent than previously thought, and provide evidence for a universal purifying selection on protein length, whose mechanism and fitness effect remain intriguing open questions.

Download Full-text

G-OnRamp: Generating genome browsers to facilitate undergraduate-driven collaborative genome annotation

10.1101/781658 ◽

2019 ◽

Author(s):

Luke Sargent ◽

Yating Liu ◽

Wilson Leung ◽

Nathan T. Mortimer ◽

David Lopatto ◽

...

Keyword(s):

Genome Annotation ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Prediction ◽

Phenotypic Traits ◽

Wasp Species ◽

Major Barrier ◽

Link Type ◽

A Genome ◽

Genome Browsers

AbstractScientists are sequencing new genomes at an increasing rate with the goal of associating genome contents with phenotypic traits. After a new genome is sequenced and assembled, structural gene annotation is often the first step in analysis. Despite advances in computational gene prediction algorithms, most eukaryotic genomes still benefit from manual gene annotation. Undergraduates can become skilled annotators, and in the process learn both about genes/genomes and about how to utilize large datasets. Data visualizations provided by a genome browser are essential for manual gene annotation, enabling annotators to quickly evaluate multiple lines of evidence (e.g., sequence similarity, RNA-Seq, gene predictions, repeats). However, creating genome browsers requires extensive computational skills; lack of the expertise required remains a major barrier for many biomedical researchers and educators.To address these challenges, the Genomics Education Partnership (GEP; https://gep.wustl.edu/) has partnered with the Galaxy Project (https://galaxyproject.org) to develop G-OnRamp (http://g-onramp.org), a web-based platform for creating UCSC Assembly Hubs and JBrowse genome browsers. G-OnRamp can also convert a JBrowse instance into an Apollo instance for collaborative genome annotations in research and educational settings. G-OnRamp enables researchers to easily visualize their experimental results, educators to create Course-based Undergraduate Research Experiences (CUREs) centered on genome annotation, and students to participate in genomics research.Development of G-OnRamp was guided by extensive user feedback from in-person workshops. Sixty-five researchers and educators from over 40 institutions participated in these workshops, which produced over 20 genome browsers now available for research and education. For example, genome browsers for four parasitoid wasp species were used in a CURE engaging 142 students taught by 13 faculty members — producing a total of 192 gene models. G-OnRamp can be deployed on a personal computer or on cloud computing platforms, and the genome browsers produced can be transferred to the CyVerse Data Store for long-term access.

Download Full-text

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

BMC Bioinformatics ◽

10.1186/s12859-017-1942-z ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 3

Author(s):

Megan J. Bowman ◽

Jane A. Pulman ◽

Tiffany L. Liu ◽

Kevin L. Childs

Keyword(s):

Oryza Sativa ◽

Gene Annotation ◽

Gc Content ◽

Novel Gene ◽

Annotation Method

Download Full-text

Chromosome-level genome assembly and transcriptome- based annotation of the oleaginous yeast Rhodotorula toruloides CBS 14

10.1101/2021.04.09.439123 ◽

2021 ◽

Author(s):

Giselle De La Caridad Martin Hernandez ◽

Bettina Muller ◽

Mikolaj Chmielarz ◽

Christian Brandt ◽

Martin Hoelzer ◽

...

Keyword(s):

Oleaginous Yeast ◽

Gene Annotation ◽

Gc Content ◽

Lipid Synthesis ◽

Growth Conditions ◽

Specific Gene ◽

Molecular Physiology ◽

Total Size ◽

A Genome ◽

Genome Draft

Rhodotorula toruloides is an oleaginous yeast with high biotechnological potential. In order to understand the molecular physiology of lipid synthesis in R. toruloides and to advance metabolic engineering, a high-resolution genome is required. We constructed a genome draft of R. toruloides CBS 14, using a hybrid assembly approach, consisting of short and long reads generated by Illumina and Nanopore sequencing, respectively. The genome draft consists of 23 contigs and 3 scaffolds, with a N50 length of 1,529,952 bp, thus largely representing chromosomal organization. The total size is 20,534,857 bp with a GC content of 61.83%. Transcriptomic data from different growth conditions was used to aid species-specific gene annotation. In total we annotated 9,464 genes and identified 11,691 transcripts. Furthermore, we demonstrated the presence of a potential plasmid, an extrachromosomal circular structure of about 11 kb with a copy number about three times as high as the other chromosomes.

Download Full-text

Genome-wide association study and transcriptome analysis discover new genes for bacterial leaf blight resistance in rice (Oryza sativa L.)

BMC Plant Biology ◽

10.1186/s12870-021-03041-2 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Xinyue Shu ◽

Aijun Wang ◽

Bo Jiang ◽

Yuqi Jiang ◽

Xing Xiang ◽

...

Keyword(s):

Oryza Sativa ◽

Association Study ◽

Transcriptome Analysis ◽

Genome Wide Association Study ◽

Leaf Blight ◽

Genome Wide Association ◽

Bacterial Leaf Blight ◽

Significant Differential Expression ◽

Genome Wide ◽

A Genome

Abstract Background Rice (Oryza sativa) bacterial leaf blight (BLB), caused by the hemibiotrophic Xanthomonas oryzae pv. oryzae (Xoo), is one of the most devastating diseases affecting the production of rice worldwide. The development and use of resistant rice varieties or genes is currently the most effective strategy to control BLB. Results Here, we used 259 rice accessions, which are genotyped with 2 888 332 high-confidence single nucleotide polymorphisms (SNPs). Combining resistance variation data of 259 rice lines for two Xoo races observed in 2 years, we conducted a genome-wide association study (GWAS) to identify quantitative trait loci (QTL) conferring plant resistance against BLB. The expression levels of genes, which contains in GWAS results were also identified between the resistant and susceptible rice lines by transcriptome analysis at four time points after pathogen inoculation. From that 109 candidate resistance genes showing significant differential expression between resistant and susceptible rice lines were uncovered. Furthermore, the haplotype block structure analysis predicted 58 candidate genes for BLB resistance based on Chr. 7_707158 with a minimum P-value (–log 10 P = 9.72). Among them, two NLR protein-encoding genes, LOC_Os07g02560 and LOC_Os07g02570, exhibited significantly high expression in the resistant line, but had low expression in the susceptible line of rice. Conclusions Together, our results reveal novel BLB resistance gene resources, and provide important genetic basis for BLB resistance breeding of rice crops.

Download Full-text

ASPic-GeneID: A Lightweight Pipeline for Gene Prediction and Alternative Isoforms Detection

BioMed Research International ◽

10.1155/2013/502827 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Tyler Alioto ◽

Ernesto Picardi ◽

Roderic Guigó ◽

Graziano Pesole

Keyword(s):

Ab Initio ◽

Gene Annotation ◽

Gene Prediction ◽

Gene Prediction Program ◽

C Elegans ◽

Prediction Program ◽

First Pass ◽

Main Components ◽

Genome Projects ◽

Alternative Isoforms

New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurateab initiogene prediction methods. However, it is apparent that fullyab initiomethods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entireC. elegansgenome and the 44 ENCODE human pilot regions.

Download Full-text

The genome sequence of the European peacock butterfly, Aglais io (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17204.1 ◽

2021 ◽

Vol 6 ◽

pp. 258

Author(s):

Konrad Lohse ◽

Alexander Mackintosh ◽

Roger Vila ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome ◽

Inachis Io

We present a genome assembly from an individual male Aglais io (also known as Inachis io and Nymphalis io) (the European peacock; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority (99.91%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 11,420 protein coding genes.

Download Full-text

Draft Genome Sequence of Lactobacillus rhamnosus OSU-PECh-69, a Cheese Isolate with Antibacterial Activity

Microbiology Resource Announcements ◽

10.1128/mra.00803-20 ◽

2020 ◽

Vol 9 (37) ◽

Author(s):

Israel García-Cano ◽

Walaa E. Hussein ◽

Diana Rocha-Mendoza ◽

Ahmed E. Yousef ◽

Rafael Jiménez-Flores

Keyword(s):

Genome Sequence ◽

Antimicrobial Agents ◽

Draft Genome ◽

Lactobacillus Rhamnosus ◽

Gc Content ◽

Gene Clusters ◽

Gram Negative Bacteria ◽

The Novel ◽

Content Type ◽

A Genome

ABSTRACT The novel strain Lactobacillus rhamnosus OSU-PECh-69 was isolated from provolone cheese. It produces antimicrobial agents having a molecular mass of 5 to 10 kDa that are active against Gram-positive and Gram-negative bacteria. The strain has a genome sequence of 3,057,669 bp, a GC content of 46.6%, and up to two gene clusters encoding bacteriocins.

Download Full-text