Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale

Mapping Intimacies ◽

10.1101/146514 ◽

2017 ◽

Author(s):

Alex V. Kotlar ◽

Cristina E. Trevino ◽

Michael E. Zwick ◽

David J. Cutler ◽

Thomas S. Wingo

Keyword(s):

Natural Language ◽

Search Engine ◽

Processing Time ◽

General Purpose ◽

Whole Genome ◽

Command Line ◽

Variant Annotation ◽

Link Type ◽

Key Innovation ◽

Genome Scale

AbstractAccurately selecting relevant alleles in large sequencing experiments remains technically challenging. Bystro (https://bystro.io/) is the first online, cloud-based application that makes variant annotation and filtering accessible to all researchers for terabyte-sized whole-genome experiments containing thousands of samples. Its key innovation is a general-purpose, natural-language search engine that enables users to identify and export alleles and samples of interest in milliseconds. The search engine dramatically simplifies complex filtering tasks that previously required programming experience or specialty command-line programs. Critically, Bystro’s annotation and filtering capabilities are orders of magnitude faster than previous solutions, saving weeks of processing time for large experiments.

Download Full-text

Bystro: rapid online variant annotation and natural-language filtering at whole-genome scale

Genome Biology ◽

10.1186/s13059-018-1387-3 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 13

Author(s):

Alex V. Kotlar ◽

Cristina E. Trevino ◽

Michael E. Zwick ◽

David J. Cutler ◽

Thomas S. Wingo

Keyword(s):

Natural Language ◽

Whole Genome ◽

Variant Annotation ◽

Genome Scale

Download Full-text

Vapur: A Search Engine to Find Related Protein - Compound Pairs in COVID-19 Literature

10.1101/2020.09.05.284224 ◽

2020 ◽

Cited By ~ 1

Author(s):

Abdullatif Köksal ◽

Hilal Dönmez ◽

Rıza Özçelik ◽

Elif Ozkirimli ◽

Arzucan Özgür

Keyword(s):

User Interface ◽

Search Engine ◽

Current Literature ◽

Search Engines ◽

General Purpose ◽

Inverted Index ◽

Related Protein ◽

Domain Specific ◽

Link Type

AbstractCoronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Vapur: an online COVID-19 search engine specifically designed to find related protein - chemical pairs. Vapur is empowered with a relation-oriented inverted index that is able to retrieve and group studies for a query biomolecule with respect to its related entities. The inverted index of Vapur is automatically created with a BioNLP pipeline and integrated with an online user interface. The online interface is designed for the smooth traversal of the current literature by domain researchers and is publicly available at https://tabilab.cmpe.boun.edu.tr/vapur/.

Download Full-text

Rapid whole genome sequence typing reveals multiple waves of SARS-CoV-2 spread

10.1101/2020.06.08.139055 ◽

2020 ◽

Cited By ~ 1

Author(s):

Ahmed M. Moustafa ◽

Paul J. Planet

Keyword(s):

Genome Sequence ◽

Whole Genome Sequence ◽

Whole Genome ◽

Command Line ◽

Genomic Information ◽

Typing Scheme ◽

Link Type ◽

Geographic Spread ◽

Sequence Types ◽

Geographical Locations

AbstractAs the pandemic SARS-CoV-2 virus has spread globally its genome has diversified to an extent that distinct clones can now be recognized, tracked, and traced. Identifying clonal groups allows for assessment of geographic spread, transmission events, and identification of new or emerging strains that may be more virulent or more transmissible. Here we present a rapid, whole genome, allele-based method (GNUVID) for assigning sequence types to sequenced isolates of SARS-CoV-2 sequences. This sequence typing scheme can be updated with new genomic information extremely rapidly, making our technique continually adaptable as databases grow. We show that our method is consistent with phylogeny and recovers waves of expansion and replacement of sequence types/clonal complexes in different geographical locations.GNUVID is available as a command line application (https://github.com/ahmedmagds/GNUVID).

Download Full-text

GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data

BMC Genomics ◽

10.1186/s12864-020-07217-2 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Andrew P. Wilkey ◽

Anne V. Brown ◽

Steven B. Cannon ◽

Ethalinda K. S. Cannon

Keyword(s):

Snp Array ◽

Whole Genome ◽

Array Data ◽

Link Type ◽

Snp Data ◽

Genome Wide ◽

Genome Visualization ◽

Data Points ◽

Genomic Regions ◽

Genome Scale

Abstract Background Large genotyping datasets have become commonplace due to efficient, cheap methods for SNP identification. Typical genotyping datasets may have thousands to millions of data points per accession, across tens to thousands of accessions. There is a need for tools to help rapidly explore such datasets, to assess characteristics such as overall differences between accessions and regional anomalies across the genome. Results We present GCViT (Genotype Comparison Visualization Tool), for visualizing and exploring large genotyping datasets. GCViT can be used to identify introgressions, conserved or divergent genomic regions, pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance for whole genome visualization of resequencing or SNP array data. The program performs comparisons of variants among user-selected accessions to identify allele differences and similarities between accessions and a user-selected reference, providing visualizations through histogram, heatmap, or haplotype views. The resulting analyses and images can be exported in various formats. Conclusions GCViT provides methods for interactively visualizing SNP data on a whole genome scale, and can produce publication-ready figures. It can be used in online or local installations. GCViT enables users to confirm or identify genomics regions of interest associated with particular traits. GCViT is freely available at https://github.com/LegumeFederation/gcvit. The 1.0 version described here is available at 10.5281/zenodo.4008713.

Download Full-text

Easy phylotyping of Escherichia coli via the EzClermont web app and command-line tool

Access Microbiology ◽

10.1099/acmi.0.000143 ◽

2020 ◽

Vol 2 (9) ◽

Cited By ~ 2

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

Escherichia Coli ◽

Type Species ◽

Whole Genome ◽

Command Line ◽

Content Type ◽

Link Type ◽

Command Line Tool ◽

Pcr Method ◽

Web App ◽

Genome Assemblies

The Clermont PCR method for phylotyping Escherichia coli remains a useful classification scheme even though genome sequencing is now routine, and higher-resolution sequence typing schemes are now available. Relating present-day whole-genome E. coli classifications to legacy phylotyping is essential for harmonizing the historical literature and understanding of this important organism. Therefore, we present EzClermont – a novel in silico Clermont PCR phylotyping tool to enable ready application of this phylotyping scheme to whole-genome assemblies. We evaluate this tool against phylogenomic classifications, and an alternative software implementation of Clermont typing. EzClermont is available as a web app at www.ezclermont.org, and as a command-line tool at https://nickp60.github.io/EzClermont/.

Download Full-text

BuddySuite: Command-line toolkits for manipulating sequences, alignments, and phylogenetic trees

10.1101/040675 ◽

2016 ◽

Author(s):

Stephen R. Bond ◽

Karl E. Keat ◽

Sofia N. Barreira ◽

Andreas D. Baxevanis

Keyword(s):

Sequence Alignment ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

General Purpose ◽

Command Line ◽

Link Type ◽

File Formats ◽

Downstream Analysis ◽

Python Package ◽

Common Sequence

AbstractThe ability to manipulate sequence, alignment, and phylogenetic tree files has become an increasingly important skill in the life sciences, whether to generate summary information or to prepare data for further downstream analysis. The command line can be an extremely powerful environment for interacting with these resources, but only if the user has the appropriate general-purpose tools on hand. BuddySuite is a collection of four independent yet interrelated command-line toolkits that facilitate each step in the workflow of sequence discovery, curation, alignment, and phylogenetic reconstruction. Most common sequence, alignment, and tree file formats are automatically detected and parsed, and over 100 tools have been implemented for manipulating these data. The project has been engineered to easily accommodate the addition of new tools, it is written in the popular programming language Python, and is hosted on the Python Package Index and GitHub to maximize accessibility. Documentation for each BuddySuite tool, including usage examples, is available at http://tiny.cc/buddysuite_wiki. All software is open source and freely available through http://research.nhgri.nih.gov/software/BuddySuite.

Download Full-text

Natural-Language-based Generation of Search Results in a Meaning-based Multilingual Search Engine

IETE Technical Review ◽

10.1080/02564602.2006.11657960 ◽

2006 ◽

Vol 23 (5) ◽

pp. 313-319

Author(s):

Yogesh P Awate ◽

Jagger Bodas ◽

Sachin Deshpande ◽

Pushpak Bhattacharyya

Keyword(s):

Natural Language ◽

Search Engine ◽

Search Results

Download Full-text

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

10.1101/173146 ◽

2017 ◽

Cited By ~ 5

Author(s):

Mickael Silva ◽

Miguel Machado ◽

Diogo N. Silva ◽

Mirko Rossi ◽

Jacob Moran-Gilad ◽

...

Keyword(s):

Open Source ◽

Core Genome ◽

Bacterial Species ◽

Outbreak Detection ◽

Strain Identification ◽

List Type ◽

Whole Genome ◽

Link Type ◽

The Creation ◽

Allele Calling

ABSTRACTGene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. The software can run in a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available athttps://github.com/B-UMMI/chewBBACAor as a docker image athttps://hub.docker.com/r/ummidock/chewbbaca/.DATA SUMMARYAssembled genomes used for the tutorial were downloaded from NCBI in August 2016 by selecting those submitted asStreptococcus agalactiaetaxon or sub-taxa. All the assemblies have been deposited as a zip file in FigShare (https://figshare.com/s/9cbe1d422805db54cd52), where a file with the original ftp link for each NCBI directory is also available.Code for the chewBBACA suite is available athttps://github.com/B-UMMI/chewBBACAwhile the tutorial example is found athttps://github.com/B-UMMI/chewBBACA_tutorial.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTThe chewBBACA software offers a computational solution for the creation, evaluation and use of whole genome (wg) and core genome (cg) multilocus sequence typing (MLST) schemas. It allows researchers to develop wg/cgMLST schemes for any bacterial species from a set of genomes of interest. The alleles identified by chewBBACA correspond to potential coding sequences, possibly offering insights into the correspondence between the genetic variability identified and phenotypic variability. The software performs allele calling in a matter of seconds to minutes per strain in a laptop but is easily scalable for the analysis of large datasets of hundreds of thousands of strains using multiprocessing options. The chewBBACA software thus provides an efficient and freely available open source solution for gene-by-gene methods. Moreover, the ability to perform these tasks locally is desirable when the submission of raw data to a central repository or web services is hindered by data protection policies or ethical or legal concerns.

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

Microbiome Search Engine 2: a Platform for Taxonomic and Functional Search of Global Microbiomes on the Whole-Microbiome Level

mSystems ◽

10.1128/msystems.00943-20 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Gongchao Jing ◽

Lu Liu ◽

Zengbin Wang ◽

Yufeng Zhang ◽

Li Qian ◽

...

Keyword(s):

Big Data ◽

User Interface ◽

Search Engine ◽

Functional Similarity ◽

Metagenomic Data ◽

Data Sets ◽

Data Space ◽

Link Type ◽

Database Platform ◽

Microbiome Data

ABSTRACT Metagenomic data sets from diverse environments have been growing rapidly. To ensure accessibility and reusability, tools that quickly and informatively correlate new microbiomes with existing ones are in demand. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes in the global metagenome data space based on the taxonomic or functional similarity of a whole microbiome to those in the database. MSE 2 consists of (i) a well-organized and regularly updated microbiome database that currently contains over 250,000 metagenomic shotgun and 16S rRNA gene amplicon samples associated with unified metadata collected from 798 studies, (ii) an enhanced search engine that enables real-time and fast (<0.5 s per query) searches against the entire database for best-matched microbiomes using overall taxonomic or functional profiles, and (iii) a Web-based graphical user interface for user-friendly searching, data browsing, and tutoring. MSE 2 is freely accessible via http://mse.ac.cn. For standalone searches of customized microbiome databases, the kernel of the MSE 2 search engine is provided at GitHub (https://github.com/qibebt-bioinfo/meta-storms). IMPORTANCE A search-based strategy is useful for large-scale mining of microbiome data sets, such as a bird’s-eye view of the microbiome data space and disease diagnosis via microbiome big data. Here, we introduce Microbiome Search Engine 2 (MSE 2), a microbiome database platform for searching query microbiomes against the existing microbiome data sets on the basis of their similarity in taxonomic structure or functional profile. Key improvements include database extension, data compatibility, a search engine kernel, and a user interface. The new ability to search the microbiome space via functional similarity greatly expands the scope of search-based mining of the microbiome big data.

Download Full-text