GFF3sort: a novel tool to sort GFF3 files for tabix indexing

Mapping Intimacies ◽

10.1101/145938 ◽

2017 ◽

Author(s):

Tao Zhu ◽

Chengzhen Liang ◽

Zhigang Meng ◽

Sandui Guo ◽

Rui Zhang

Keyword(s):

Data Processing ◽

Genome Annotation ◽

Traditional Method ◽

Gene Annotation ◽

Conversion Process ◽

Start Position ◽

Annotation Data ◽

Novel Strategy ◽

Parent Child

AbstractBackground:The traditional method of visualizing gene annotation data in JBrowse is converting GFF3 files to JSON format, which is time-consuming. The latest version of JBrowse supports rendering sorted GFF3 files indexed by tabix, a novel strategy that is more convenient than the original conversion process. However, current tools available for GFF3 file sorting have some limitations and their sorting results would lead to erroneous rendering in JBrowse.Results:We developed GFF3sort, a script to sort GFF3 files for tabix indexing. Specifically designed for JBrowse rendering, GFF3sort can properly deal with the order of features that have the same chromosome and start position, either by remembering their original orders or by conducting parent-child topology sorting. Based on our test datasets from seven species, GFF3sort produced accurate sorting results with acceptable efficiency compared with currently available tools.Conclusions:GFF3sort is a novel tool to sort GFF3 files for tabix indexing. We anticipate that GFF3sort will be useful to help with genome annotation data processing and visualization.

Download Full-text

Uncovering transcriptional dark matter via gene annotation independent single-cell RNA sequencing analysis

Nature Communications ◽

10.1038/s41467-021-22496-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Michael F. Z. Wang ◽

Madhav Mantri ◽

Shao-Pei Chou ◽

Gaetano J. Scuderi ◽

David W. McKellar ◽

...

Keyword(s):

Single Cell ◽

Genome Annotation ◽

Gene Annotation ◽

Active Regions ◽

Sequencing Analysis ◽

Biologically Relevant ◽

Mole Rat ◽

Genome Annotations ◽

Cell Expression ◽

High Quality Genome

AbstractConventional scRNA-seq expression analyses rely on the availability of a high quality genome annotation. Yet, as we show here with scRNA-seq experiments and analyses spanning human, mouse, chicken, mole rat, lemur and sea urchin, genome annotations are often incomplete, in particular for organisms that are not routinely studied. To overcome this hurdle, we created a scRNA-seq analysis routine that recovers biologically relevant transcriptional activity beyond the scope of the best available genome annotation by performing scRNA-seq analysis on any region in the genome for which transcriptional products are detected. Our tool generates a single-cell expression matrix for all transcriptionally active regions (TARs), performs single-cell TAR expression analysis to identify biologically significant TARs, and then annotates TARs using gene homology analysis. This procedure uses single-cell expression analyses as a filter to direct annotation efforts to biologically significant transcripts and thereby uncovers biology to which scRNA-seq would otherwise be in the dark.

Download Full-text

MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data

Nucleic Acids Research ◽

10.1093/nar/gkm343 ◽

2007 ◽

Vol 35 (Web Server) ◽

pp. W201-W205 ◽

Cited By ~ 4

Author(s):

C. D. Schmid ◽

T. Sengstag ◽

P. Bucher ◽

M. Delorenzi

Keyword(s):

Genome Annotation ◽

One Dimensional ◽

Annotation Data

Download Full-text

The use of genome annotation data and its impact on biological conclusions

Nature Genetics ◽

10.1038/ng1004-1028b ◽

2004 ◽

Vol 36 (10) ◽

pp. 1028-1029 ◽

Cited By ~ 4

Author(s):

Hervé Tettelin ◽

Julian Parkhill

Keyword(s):

Genome Annotation ◽

Annotation Data

Download Full-text

Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes

10.1101/2022.01.13.476251 ◽

2022 ◽

Author(s):

Caroline M. Weisman ◽

Andrew M. Murray ◽

Sean R Eddy

Keyword(s):

Comparative Analysis ◽

Case Studies ◽

Dna Sequences ◽

Genome Annotation ◽

Gene Annotation ◽

Genome Sequences ◽

Gene Annotations ◽

Genetic Novelty ◽

The Impact ◽

Lineage Specific Genes

Comparisons of genomes of different species are used to identify lineage-specific genes, those genes that appear unique to one species or clade. Lineage-specific genes are often thought to represent genetic novelty that underlies unique adaptations. Identification of these genes depends not only on genome sequences, but also on inferred gene annotations. Comparative analyses typically use available genomes that have been annotated using different methods, increasing the risk that orthologous DNA sequences may be erroneously annotated as a gene in one species but not another, appearing lineage-specific as a result. To evaluate the impact of such 'annotation heterogeneity', we identified four clades of species with sequenced genomes with more than one publicly available gene annotation, allowing us to compare the number of lineage-specific genes inferred when differing annotation methods are used to those resulting when annotation method is uniform across the clade. In these case studies, annotation heterogeneity increases the apparent number of lineage-specific genes by up to 15-fold, suggesting that annotation heterogeneity is a substantial source of potential artifact.

Download Full-text

GAD: a Python script for dividing genome annotation files into feature-based files

10.1101/815860 ◽

2019 ◽

Author(s):

Ahmed Karam ◽

Norhan Yasser

Keyword(s):

Data Analysis ◽

Genome Annotation ◽

Gene Annotation ◽

Untranslated Regions ◽

File Formats ◽

Genome Features ◽

Daily Task ◽

Intergenic Regions ◽

Feature Based ◽

Genomic Data Analysis

AbstractNowadays, manipulating and analyzing publicly available genomic datasets become a daily task in bioinformatics and genomics laboratories. The release of several genome sequencing projects prompts bioinformaticians to develop automated scripts and pipelines which analyze genomic datasets in particular gene annotation pipelines. Handling genome annotation files with fully-featured programs used by non-developers is necessary, furthermore, accelerating genomic data analysis with a focus on diminishing the genome annotation and sequence files based on specific features is required. Consequently, to extract genome features from GTF or GFF3 in a precise manner, GAD script (https://github.com/bio-projects/GAD) provides a simple graphical user interface which interpreted by all python versions installed in different operating systems. GAD script contains unique entry widgets which are capable to analyze multiple genome sequence and annotation files by a click. With highly influential coded functions, genome features such upstream genes, downstream genes, intergenic regions, genes, transcripts, exons, introns, coding sequences, five prime untranslated regions, and three prime untranslated regions and other ambiguous sequence ontology terms will be extracted. GAD script outputs the results in diverse file formats such as BED, GTF/GFF3 and FASTA files which supported by other bioinformatics programs. Our script could be incorporated into various pipelines in all genomics laboratories with the aim of accelerating data analysis.

Download Full-text

Bovine Genome Database: new annotation tools for a new reference genome

Nucleic Acids Research ◽

10.1093/nar/gkz944 ◽

2019 ◽

Cited By ~ 3

Author(s):

Md Shamimuzzaman ◽

Justin J Le Tourneau ◽

Deepak R Unni ◽

Colin M Diesh ◽

Deborah A Triant ◽

...

Keyword(s):

Genome Assembly ◽

Genome Annotation ◽

Reference Genome ◽

Gene Annotation ◽

Bovine Genome ◽

Data Retrieval ◽

Genome Database ◽

Ruminant Species ◽

Search Tool ◽

Efficient Data

Abstract The Bovine Genome Database (BGD) (http://bovinegenome.org) has been the key community bovine genomics database for more than a decade. To accommodate the increasing amount and complexity of bovine genomics data, BGD continues to advance its practices in data acquisition, curation, integration and efficient data retrieval. BGD provides tools for genome browsing (JBrowse), genome annotation (Apollo), data mining (BovineMine) and sequence database searching (BLAST). To augment the BGD genome annotation capabilities, we have developed a new Apollo plug-in, called the Locus-Specific Alternate Assembly (LSAA) tool, which enables users to identify and report potential genome assembly errors and structural variants. BGD now hosts both the newest bovine reference genome assembly, ARS-UCD1.2, as well as the previous reference genome, UMD3.1.1, with cross-genome navigation and queries supported in JBrowse and BovineMine, respectively. Other notable enhancements to BovineMine include the incorporation of genomes and gene annotation datasets for non-bovine ruminant species (goat and sheep), support for multiple assemblies per organism in the Regions Search tool, integration of additional ontologies and development of many new template queries. To better serve the research community, we continue to focus on improving existing tools, developing new tools, adding new datasets and encouraging researchers to use these resources.

Download Full-text

Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

mSystems ◽

10.1128/msystems.00833-20 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Patrick Willems ◽

Igor Fijalkowski ◽

Petra Van Damme

Keyword(s):

Genome Annotation ◽

Deinococcus Radiodurans ◽

Gene Annotation ◽

Bacterial Genome ◽

Prokaryotic Genome ◽

Ribosome Profiling ◽

Great Promise ◽

Data Sets ◽

Proteome Coverage ◽

Content Type

ABSTRACT Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation. IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.

Download Full-text

AnnoGen: annotating genome-wide pragmatic features

Bioinformatics ◽

10.1093/bioinformatics/btaa027 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2899-2901

Author(s):

Quanhu Sheng ◽

Hui Yu ◽

Olufunmilola Oyebamiji ◽

Jiandong Wang ◽

Danqian Chen ◽

...

Keyword(s):

Genome Annotation ◽

Reference Genome ◽

Bioinformatics Analysis ◽

Sequence Information ◽

Genomic Features ◽

Single Base ◽

Annotation Data ◽

Genome Wide ◽

Genomic Regions ◽

First Time

Abstract Motivation Genome annotation is an important step for all in-depth bioinformatics analysis. It is imperative to augment quantity and diversity of genome-wide annotation data for the latest reference genome to promote its adoption by ongoing and future impactful studies. Results We developed a python toolkit AnnoGen, which at the first time, allows the annotation of three pragmatic genomic features for the GRCh38 genome in enormous base-wise quantities. The three features are chemical binding Energy, sequence information Entropy and Homology Score. The Homology Score is an exceptional feature that captures the genome-wide homology through single-base-offset tiling windows of 100 continual nucleotide bases. AnnoGen is capable of annotating the proprietary pragmatic features for variable user-interested genomic regions and optionally comparing two parallel sets of genomic regions. AnnoGen is characterized with simple utility modes and succinct HTML report of informative statistical tables and plots. Availability and implementation https://github.com/shengqh/annogen.

Download Full-text