PuMA: a papillomavirus genome annotation tool

Mapping Intimacies ◽

10.1101/736991 ◽

2019 ◽

Author(s):

J. Pace ◽

K. Youens-Clark ◽

C. Freeman ◽

B. Hurwitz ◽

K. Van Doorslaer

Keyword(s):

Genome Annotation ◽

High Throughput Sequencing ◽

Viral Metagenomics ◽

Annotation Tool ◽

General Applicability ◽

Viral Genomes ◽

Link Type ◽

Sequencing Technologies ◽

Reproducible Method ◽

Analytical Approaches

ABSTRACTHigh-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of ‘viral metagenomics’ has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on the genomic organization of that virus. There is a need to develop analytical approaches to reconstruct, annotate, and classify viral genomes. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate novel papillomaviruses. PuMA provides an accessible method for automated papillomavirus genome annotation. PuMA currently has a 98% accuracy when benchmarked against the 481 reference genomes in the papillomavirus episteme (PaVE). Finally, PuMA was used to annotate 168 newly isolated papillomaviruses, and successfully annotated 1424 viral features. To demonstrate its general applicability, we developed a version of PuMA that can annotate polyomaviruses.PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma)

Download Full-text

PuMA: A papillomavirus genome annotation tool

Virus Evolution ◽

10.1093/ve/veaa068 ◽

2020 ◽

Vol 6 (2) ◽

Author(s):

Josh Pace ◽

Ken Youens-Clark ◽

Cordell Freeman ◽

Bonnie Hurwitz ◽

Koenraad Van Doorslaer

Keyword(s):

High Throughput Sequencing ◽

Viral Metagenomics ◽

Annotation Tool ◽

General Applicability ◽

Virus Family ◽

Sequencing Technologies ◽

Preliminary Version ◽

Reproducible Method ◽

Reference Genomes ◽

Viral Annotation

Abstract High-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of ‘viral metagenomics’ has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on (the genomic organization of) that virus family. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate and report novel papillomaviruses. PuMA currently correctly annotates 99% of the papillomavirus genes when benchmarked against the 655 reference genomes in the papillomavirus episteme. Compared to another viral annotation pipeline, PuMA annotates more viral features while being more accurate. To demonstrate its general applicability, we also developed a preliminary version of PuMA that can annotate polyomaviruses. PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma).

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

PhageTerm: a Fast and User-friendly Software to Determine Bacteriophage Termini and Packaging Mode using randomly fragmented NGS data

10.1101/108100 ◽

2017 ◽

Cited By ~ 2

Author(s):

Julian Garneau ◽

Florence Depardieu ◽

Louis-Charles Fortier ◽

David Bikard ◽

Marc Monot

Keyword(s):

High Throughput Sequencing ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Link Type ◽

Sequencing Technologies ◽

Statistical Framework ◽

Fastq Format ◽

Viral Particles ◽

User Friendly ◽

Ngs Data

ABSTRACTBacteriophages are the most abundant viruses on earth and display an impressive genetic as well as morphologic diversity. Among those, the most common order of phages is the Caudovirales, whose viral particles packages linear double stranded DNA (dsDNA). In this study we investigated how the information gathered by high throughput sequencing technologies can be used to determine the DNA termini and packaging mechanisms of dsDNA phages. The wet-lab procedures traditionally used for this purpose rely on the identification and cloning of restriction fragment which can be delicate and cumbersome. Here, we developed a theoretical and statistical framework to analyze DNA termini and phage packaging mechanisms using next-generation sequencing data. Our methods, implemented in the PhageTerm software, work with sequencing reads in fastq format and the corresponding assembled phage genome.PhageTerm was validated on a set of phages with well-established packaging mechanisms representative of the termini diversity: 5’cos (lambda), 3’cos (HK97), pac (P1), headful without a pac site (T4), DTR (T7) and host fragment (Mu). In addition, we determined the termini of 9Clostridium difficilephages and 6 phages whose sequences where retrieved from the sequence read archive (SRA).A direct graphical interface is available as a Galaxy wrapper version athttps://galaxy.pasteur.frand a standalone version is accessible athttps://sourceforge.net/projects/phageterm/.

Download Full-text

PgRC: Pseudogenome based Read Compressor

10.1101/710822 ◽

2019 ◽

Author(s):

Tomasz Kowalski ◽

Szymon Grabowski

Keyword(s):

High Throughput ◽

Compression Ratio ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Quality ◽

Link Type ◽

Sequencing Technologies ◽

Significant Interest ◽

The One ◽

Shortest Common Superstring

AbstractMotivationThe amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.ResultsWe present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.AvailabilityPgRC can be downloaded from https://github.com/kowallus/[email protected]

Download Full-text

INsPeCT: INtegrative Platform for Cancer Transcriptomics

Cancer Informatics ◽

10.4137/cin.s13630 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S13630 ◽

Cited By ~ 3

Author(s):

Piyush B. Madhamshettiwar ◽

Stefan R. Maetschke ◽

Melissa J. Davis ◽

Antonio Reverter ◽

Mark A. Ragan

Keyword(s):

Cancer Biology ◽

Network Inference ◽

High Throughput Sequencing ◽

Analytical Framework ◽

Data Infrastructure ◽

Sequencing Technologies ◽

Regulatory Module ◽

Highly Correlated ◽

User Friendly ◽

Analytical Approaches

The emergence of transcriptomics, fuelled by high-throughput sequencing technologies, has changed the nature of cancer research and resulted in a massive accumulation of data. Computational analysis, integration, and data visualization are now major bottlenecks in cancer biology and translational research. Although many tools have been brought to bear on these problems, their use remains unnecessarily restricted to computational biologists, as many tools require scripting skills, data infrastructure, and powerful computational facilities. New user-friendly, integrative, and automated analytical approaches are required to make computational methods more generally useful to the research community. Here we present INsPeCT (INtegrative Platform for Cancer Transcriptomics), which allows users with basic computer skills to perform comprehensive in-silico analyses of microarray, ChlPseq, and RNA-seq data. INsPeCT supports the selection of interesting genes for advanced functional analysis. Included in its automated workflows are (i) a novel analytical framework, RMaNI (regulatory module network inference), which supports the inference of cancer subtype-specific transcriptional module networks and the analysis of modules; and (ii) WGCNA (weighted gene co-expression network analysis), which infers modules of highly correlated genes across microarray samples, associated with sample traits, eg survival time. INsPeCT is available free of cost from Bioinformatics Resource Australia-EMBL and can be accessed at http://inspect.braembl.org.au .

Download Full-text

Comparative analysis of gene prediction tools for viral genome annotation

10.1101/2021.12.11.472104 ◽

2021 ◽

Author(s):

Enrique González-Tortuero ◽

Revathy Krishnamurthi ◽

Heather E. Allison ◽

Ian B. Goodhead ◽

Chloe E. James

Keyword(s):

Genome Annotation ◽

Viral Genome ◽

High Throughput Sequencing ◽

Rna Viruses ◽

Gene Prediction ◽

Dna Viruses ◽

Reading Frame ◽

Viral Genomes ◽

Acid Type ◽

Sequencing Platforms

The number of newly available viral genomes and metagenomes has increased exponentially since the development of high throughput sequencing platforms and genome analysis tools. Bioinformatic annotation pipelines are largely based on open reading frame (ORF) calling software, which identifies genes independently of the sequence taxonomical background. Although ORF-calling programs provide a rapid genome annotation, they can misidentify ORFs and start codons; errors that might be perpetuated and propagated over time. This study evaluated the performance of multiple ORF-calling programs for viral genome annotation against the complete RefSeq viral database. Programs outputs varied when considering the viral nucleic acid type versus the viral host. According to the number of ORFs, Prodigal and Metaprodigal were the most accurate programs for DNA viruses, while FragGeneScan and Prodigal generated the most accurate outputs for RNA viruses. Similarly, Prodigal outperformed the benchmark for viruses infecting prokaryotes, and GLIMMER and GeneMarkS produced the most accurate annotations for viruses infecting eukaryotes. When the coordinates of the ORFs were considered, Prodigal scored high for all scenarios except for RNA viruses, where GeneMarkS generated the most reliable results. Overall, the quality of the coordinates predicted for RNA viruses was poorer than for DNA viruses, suggesting the need for improved ORF-calling programs to deal with RNA viruses. Moreover, none of the ORF-calling programs reached 90% accuracy for annotation of DNA viruses. Any automatic annotation can still be improved by manual curation, especially when the presence of ORFs is validated with wet-lab experiments. However, our evaluation of the current ORF-calling programs is expected to be useful for the improvement of viral genome annotation pipelines and highlights the need for more expression data to improve the rigor of reference genomes.

Download Full-text

metabolisHMM: Phylogenomic analysis for exploration of microbial phylogenies and metabolic pathways

10.1101/2019.12.20.884627 ◽

2019 ◽

Cited By ~ 1

Author(s):

E.A. McDaniel ◽

K. Anantharaman ◽

K.D. McMahon

Keyword(s):

High Throughput Sequencing ◽

Markov Models ◽

Marker Gene ◽

Phylogenomic Analysis ◽

Metagenomic Sequencing ◽

Metabolic Characteristics ◽

Link Type ◽

Sequencing Technologies ◽

Single Marker ◽

User Friendly

AbstractSummaryAdvances in high-throughput sequencing technologies and bioinformatic pipelines have exponentially increased the amount of data that can be obtained from uncultivated microbial lineages inhabiting diverse ecosystems. Various annotation tools and databases currently exist for predicting the functional potential of sequenced genomes or microbial communities based upon sequence identity. However, intuitive, reproducible, and user-friendly tools for further exploring and visualizing functional guilds of microbial community metagenomic sequencing datasets remains lacking. Here, we present metabolisHMM, a series of workflows for visualizing the distribution of curated and user-provided Hidden Markov Models (HMMs) to understand metabolic characteristics and evolutionary histories of microbial lineages. metabolisHMM performs functional annotations with a set of curated or user-defined HMMs to 1) construct ribosomal protein and single marker gene phylogenies, 2) summarize the presence/absence of metabolic pathway markers, and 3) create heatmap visualizations of presence/absence summaries.Availability and ImplementationmetabolisHMM is freely available on Github at https://github.com/elizabethmcd/metabolisHMM and on PyPi at https://pypi.org/project/metabolisHMM/ under the GNU General Public License v3.0.

Download Full-text

Standards for Sequencing Viral Genomes in the Era of High-Throughput Sequencing

mBio ◽

10.1128/mbio.01360-14 ◽

2014 ◽

Vol 5 (3) ◽

Cited By ~ 59

Author(s):

Jason T. Ladner ◽

Brett Beitzel ◽

Patrick S. G. Chain ◽

Matthew G. Davenport ◽

Eric Donaldson ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cost Benefit ◽

Genome Sequences ◽

Common Component ◽

Viral Genomes ◽

Genome Finishing ◽

Sequencing Technologies ◽

Trade Offs ◽

Sequencing Platforms

ABSTRACT Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques.

Download Full-text

VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator

10.1101/277509 ◽

2018 ◽

Cited By ~ 8

Author(s):

Enrique González-Tortuero ◽

Thomas David Sean Sutton ◽

Vimalkumar Velayudhan ◽

Andrey Nikolaevich Shkoporov ◽

Lorraine Anne Draper ◽

...

Keyword(s):

Genome Annotation ◽

Viral Genome ◽

De Novo ◽

Lower Number ◽

Viral Metagenomics ◽

Viral Genomes ◽

Genomic Studies ◽

Bioinformatic Approaches ◽

Viral Sequences

AbstractViral (meta)genomics is a rapidly growing field of study that is hampered by an inability to annotate the majority of viral sequences; therefore, the development of new bioinformatic approaches is very important. Here, we present a new automatic de novo genome annotation pipeline, called VIGA, to annotate prokaryotic and eukaryotic viral sequences from (meta)genomic studies. VIGA was benchmarked on a database of known viral genomes and a viral metagenomics case study. VIGA generated the most accurate outputs according to the number of coding sequences and their coordinates, outputs also had a lower number of non-informative annotations compared to other programs.

Download Full-text

Plant Virus Vectors 3.0: Transitioning into Synthetic Genomics

Annual Review of Phytopathology ◽

10.1146/annurev-phyto-082718-100301 ◽

2019 ◽

Vol 57 (1) ◽

pp. 211-230 ◽

Cited By ~ 17

Author(s):

Will B. Cody ◽

Herman B. Scholthof

Keyword(s):

Reverse Genetics ◽

High Throughput Sequencing ◽

Viral Vector ◽

Heterologous Gene Expression ◽

Plant Viruses ◽

Expression Vectors ◽

Virus Induced Gene Silencing ◽

Viral Genomes ◽

Sequencing Technologies ◽

Synthetic Genomics

Plant viruses were first implemented as heterologous gene expression vectors more than three decades ago. Since then, the methodology for their use has varied, but we propose it was the merging of technologies with virology tools, which occurred in three defined steps discussed here, that has driven viral vector applications to date. The first was the advent of molecular biology and reverse genetics, which enabled the cloning and manipulation of viral genomes to express genes of interest (vectors 1.0). The second stems from the discovery of RNA silencing and the development of high-throughput sequencing technologies that allowed the convenient and widespread use of virus-induced gene silencing (vectors 2.0). Here, we briefly review the events that led to these applications, but this treatise mainly concentrates on the emerging versatility of gene-editing tools, which has enabled the emergence of virus-delivered genetic queries for functional genomics and virology (vectors 3.0).

Download Full-text