scholarly journals PuMA: a papillomavirus genome annotation tool

2019 ◽  
Author(s):  
J. Pace ◽  
K. Youens-Clark ◽  
C. Freeman ◽  
B. Hurwitz ◽  
K. Van Doorslaer

ABSTRACTHigh-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of ‘viral metagenomics’ has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on the genomic organization of that virus. There is a need to develop analytical approaches to reconstruct, annotate, and classify viral genomes. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate novel papillomaviruses. PuMA provides an accessible method for automated papillomavirus genome annotation. PuMA currently has a 98% accuracy when benchmarked against the 481 reference genomes in the papillomavirus episteme (PaVE). Finally, PuMA was used to annotate 168 newly isolated papillomaviruses, and successfully annotated 1424 viral features. To demonstrate its general applicability, we developed a version of PuMA that can annotate polyomaviruses.PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma)

2020 ◽  
Vol 6 (2) ◽  
Author(s):  
Josh Pace ◽  
Ken Youens-Clark ◽  
Cordell Freeman ◽  
Bonnie Hurwitz ◽  
Koenraad Van Doorslaer

Abstract High-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of ‘viral metagenomics’ has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on (the genomic organization of) that virus family. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate and report novel papillomaviruses. PuMA currently correctly annotates 99% of the papillomavirus genes when benchmarked against the 655 reference genomes in the papillomavirus episteme. Compared to another viral annotation pipeline, PuMA annotates more viral features while being more accurate. To demonstrate its general applicability, we also developed a preliminary version of PuMA that can annotate polyomaviruses. PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma).


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2017 ◽  
Author(s):  
Julian Garneau ◽  
Florence Depardieu ◽  
Louis-Charles Fortier ◽  
David Bikard ◽  
Marc Monot

ABSTRACTBacteriophages are the most abundant viruses on earth and display an impressive genetic as well as morphologic diversity. Among those, the most common order of phages is the Caudovirales, whose viral particles packages linear double stranded DNA (dsDNA). In this study we investigated how the information gathered by high throughput sequencing technologies can be used to determine the DNA termini and packaging mechanisms of dsDNA phages. The wet-lab procedures traditionally used for this purpose rely on the identification and cloning of restriction fragment which can be delicate and cumbersome. Here, we developed a theoretical and statistical framework to analyze DNA termini and phage packaging mechanisms using next-generation sequencing data. Our methods, implemented in the PhageTerm software, work with sequencing reads in fastq format and the corresponding assembled phage genome.PhageTerm was validated on a set of phages with well-established packaging mechanisms representative of the termini diversity: 5’cos (lambda), 3’cos (HK97), pac (P1), headful without a pac site (T4), DTR (T7) and host fragment (Mu). In addition, we determined the termini of 9Clostridium difficilephages and 6 phages whose sequences where retrieved from the sequence read archive (SRA).A direct graphical interface is available as a Galaxy wrapper version athttps://galaxy.pasteur.frand a standalone version is accessible athttps://sourceforge.net/projects/phageterm/.


2019 ◽  
Author(s):  
Tomasz Kowalski ◽  
Szymon Grabowski

AbstractMotivationThe amount of sequencing data from High-Throughput Sequencing technologies grows at a pace exceeding the one predicted by Moore’s law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources.ResultsWe present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.AvailabilityPgRC can be downloaded from https://github.com/kowallus/[email protected]


2014 ◽  
Vol 13 ◽  
pp. CIN.S13630 ◽  
Author(s):  
Piyush B. Madhamshettiwar ◽  
Stefan R. Maetschke ◽  
Melissa J. Davis ◽  
Antonio Reverter ◽  
Mark A. Ragan

The emergence of transcriptomics, fuelled by high-throughput sequencing technologies, has changed the nature of cancer research and resulted in a massive accumulation of data. Computational analysis, integration, and data visualization are now major bottlenecks in cancer biology and translational research. Although many tools have been brought to bear on these problems, their use remains unnecessarily restricted to computational biologists, as many tools require scripting skills, data infrastructure, and powerful computational facilities. New user-friendly, integrative, and automated analytical approaches are required to make computational methods more generally useful to the research community. Here we present INsPeCT (INtegrative Platform for Cancer Transcriptomics), which allows users with basic computer skills to perform comprehensive in-silico analyses of microarray, ChlPseq, and RNA-seq data. INsPeCT supports the selection of interesting genes for advanced functional analysis. Included in its automated workflows are (i) a novel analytical framework, RMaNI (regulatory module network inference), which supports the inference of cancer subtype-specific transcriptional module networks and the analysis of modules; and (ii) WGCNA (weighted gene co-expression network analysis), which infers modules of highly correlated genes across microarray samples, associated with sample traits, eg survival time. INsPeCT is available free of cost from Bioinformatics Resource Australia-EMBL and can be accessed at http://inspect.braembl.org.au .


2021 ◽  
Author(s):  
Enrique González-Tortuero ◽  
Revathy Krishnamurthi ◽  
Heather E. Allison ◽  
Ian B. Goodhead ◽  
Chloe E. James

The number of newly available viral genomes and metagenomes has increased exponentially since the development of high throughput sequencing platforms and genome analysis tools. Bioinformatic annotation pipelines are largely based on open reading frame (ORF) calling software, which identifies genes independently of the sequence taxonomical background. Although ORF-calling programs provide a rapid genome annotation, they can misidentify ORFs and start codons; errors that might be perpetuated and propagated over time. This study evaluated the performance of multiple ORF-calling programs for viral genome annotation against the complete RefSeq viral database. Programs outputs varied when considering the viral nucleic acid type versus the viral host. According to the number of ORFs, Prodigal and Metaprodigal were the most accurate programs for DNA viruses, while FragGeneScan and Prodigal generated the most accurate outputs for RNA viruses. Similarly, Prodigal outperformed the benchmark for viruses infecting prokaryotes, and GLIMMER and GeneMarkS produced the most accurate annotations for viruses infecting eukaryotes. When the coordinates of the ORFs were considered, Prodigal scored high for all scenarios except for RNA viruses, where GeneMarkS generated the most reliable results. Overall, the quality of the coordinates predicted for RNA viruses was poorer than for DNA viruses, suggesting the need for improved ORF-calling programs to deal with RNA viruses. Moreover, none of the ORF-calling programs reached 90% accuracy for annotation of DNA viruses. Any automatic annotation can still be improved by manual curation, especially when the presence of ORFs is validated with wet-lab experiments. However, our evaluation of the current ORF-calling programs is expected to be useful for the improvement of viral genome annotation pipelines and highlights the need for more expression data to improve the rigor of reference genomes.


Author(s):  
E.A. McDaniel ◽  
K. Anantharaman ◽  
K.D. McMahon

AbstractSummaryAdvances in high-throughput sequencing technologies and bioinformatic pipelines have exponentially increased the amount of data that can be obtained from uncultivated microbial lineages inhabiting diverse ecosystems. Various annotation tools and databases currently exist for predicting the functional potential of sequenced genomes or microbial communities based upon sequence identity. However, intuitive, reproducible, and user-friendly tools for further exploring and visualizing functional guilds of microbial community metagenomic sequencing datasets remains lacking. Here, we present metabolisHMM, a series of workflows for visualizing the distribution of curated and user-provided Hidden Markov Models (HMMs) to understand metabolic characteristics and evolutionary histories of microbial lineages. metabolisHMM performs functional annotations with a set of curated or user-defined HMMs to 1) construct ribosomal protein and single marker gene phylogenies, 2) summarize the presence/absence of metabolic pathway markers, and 3) create heatmap visualizations of presence/absence summaries.Availability and ImplementationmetabolisHMM is freely available on Github at https://github.com/elizabethmcd/metabolisHMM and on PyPi at https://pypi.org/project/metabolisHMM/ under the GNU General Public License v3.0.


mBio ◽  
2014 ◽  
Vol 5 (3) ◽  
Author(s):  
Jason T. Ladner ◽  
Brett Beitzel ◽  
Patrick S. G. Chain ◽  
Matthew G. Davenport ◽  
Eric Donaldson ◽  
...  

ABSTRACT Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques.


2018 ◽  
Author(s):  
Enrique González-Tortuero ◽  
Thomas David Sean Sutton ◽  
Vimalkumar Velayudhan ◽  
Andrey Nikolaevich Shkoporov ◽  
Lorraine Anne Draper ◽  
...  

AbstractViral (meta)genomics is a rapidly growing field of study that is hampered by an inability to annotate the majority of viral sequences; therefore, the development of new bioinformatic approaches is very important. Here, we present a new automatic de novo genome annotation pipeline, called VIGA, to annotate prokaryotic and eukaryotic viral sequences from (meta)genomic studies. VIGA was benchmarked on a database of known viral genomes and a viral metagenomics case study. VIGA generated the most accurate outputs according to the number of coding sequences and their coordinates, outputs also had a lower number of non-informative annotations compared to other programs.


2019 ◽  
Vol 57 (1) ◽  
pp. 211-230 ◽  
Author(s):  
Will B. Cody ◽  
Herman B. Scholthof

Plant viruses were first implemented as heterologous gene expression vectors more than three decades ago. Since then, the methodology for their use has varied, but we propose it was the merging of technologies with virology tools, which occurred in three defined steps discussed here, that has driven viral vector applications to date. The first was the advent of molecular biology and reverse genetics, which enabled the cloning and manipulation of viral genomes to express genes of interest (vectors 1.0). The second stems from the discovery of RNA silencing and the development of high-throughput sequencing technologies that allowed the convenient and widespread use of virus-induced gene silencing (vectors 2.0). Here, we briefly review the events that led to these applications, but this treatise mainly concentrates on the emerging versatility of gene-editing tools, which has enabled the emergence of virus-delivered genetic queries for functional genomics and virology (vectors 3.0).


Sign in / Sign up

Export Citation Format

Share Document