Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing

Tobias Rausch; Markus Hsi-Yang Fritz; Jan O Korbel; Vladimir Benes

doi:10.1093/bioinformatics/bty1007

Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing

Bioinformatics ◽

10.1093/bioinformatics/bty1007 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2489-2491 ◽

Cited By ~ 11

Author(s):

Tobias Rausch ◽

Markus Hsi-Yang Fritz ◽

Jan O Korbel ◽

Vladimir Benes

Keyword(s):

Web Application ◽

Large Scale ◽

Sequence Data ◽

Supplementary Information ◽

Insert Size ◽

Read Group ◽

Allele Specific ◽

Public Datasets ◽

Research Initiatives ◽

Generation Sequencing

Abstract Summary Harmonizing quality control (QC) of large-scale second and third-generation sequencing datasets is key for enabling downstream computational and biological analyses. We present Alfred, an efficient and versatile command-line application that computes multi-sample QC metrics in a read-group aware manner, across a wide variety of sequencing assays and technologies. In addition to standard QC metrics such as GC bias, base composition, insert size and sequencing coverage distributions it supports haplotype-aware and allele-specific feature counting and feature annotation. The versatility of Alfred allows for easy pipeline integration in high-throughput settings, including DNA sequencing facilities and large-scale research initiatives, enabling continuous monitoring of sequence data quality and characteristics across samples. Alfred supports haplo-tagging of BAM/CRAM files to conduct haplotype-resolved analyses in conjunction with a variety of next-generation sequencing based assays. Alfred’s companion web application enables interactive exploration of results and comparison to public datasets. Availability and implementation Alfred is open-source and freely available at https://tobiasrausch.com/alfred/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

EARRINGS: an efficient and accurate adapter trimmer entails no a priori adapter sequences

Bioinformatics ◽

10.1093/bioinformatics/btab025 ◽

2021 ◽

Author(s):

Ting-Hsuan Wang ◽

Cheng-Ching Huang ◽

Jui-Hung Hung

Keyword(s):

Open Source Software ◽

Large Scale ◽

A Priori ◽

Supplementary Information ◽

Supplementary Data ◽

Comparable Accuracy ◽

Meta Analyses ◽

Next Generation Sequencing Ngs ◽

Adapter Trimming ◽

Generation Sequencing

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Validation of variants using cost effective highresolution melting (HRM) analysis predicted from target re-sequencing in Eucalyptus

Acta Botanica Croatica ◽

10.37427/botcro-2020-019 ◽

2020 ◽

Vol 79 (2) ◽

pp. 105-113

Author(s):

Abdul Bari Muneera Parveen ◽

Divya Lakshmanan ◽

Modhumita Ghosh Dasgupta

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Sequence Data ◽

Cost Effective ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Time Saving ◽

Hrm Analysis ◽

The Cost ◽

Generation Sequencing

The advent of next-generation sequencing has facilitated large-scale discovery and mapping of genomic variants for high-throughput genotyping. Several research groups working in tree species are presently employing next generation sequencing (NGS) platforms for marker discovery, since it is a cost effective and time saving strategy. However, most trees lack a chromosome level genome map and validation of variants for downstream application becomes obligatory. The cost associated with identifying potential variants from the enormous amount of sequence data is a major limitation. In the present study, high resolution melting (HRM) analysis was optimized for rapid validation of single nucleotide polymorphisms (SNPs), insertions or deletions (InDels) and simple sequence repeats (SSRs) predicted from exome sequencing of parents and hybrids of Eucalyptus tereticornis Sm. ? Eucalyptus grandis Hill ex Maiden generated from controlled hybridization. The cost per data point was less than 0.5 USD, providing great flexibility in terms of cost and sensitivity, when compared to other validation methods. The sensitivity of this technology in variant detection can be extended to other applications including Bar-HRM for species authentication and TILLING for detection of mutants.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa233 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3874-3876 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R C Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Network Partitioning ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read

Abstract Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data are often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and network partitioning based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short-read sequence data. Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

A curated compendium of monocyte transcriptome datasets of relevance to human monocyte immunobiology research

F1000Research ◽

10.12688/f1000research.8182.2 ◽

2016 ◽

Vol 5 ◽

pp. 291 ◽

Cited By ~ 11

Author(s):

Darawan Rinchai ◽

Sabri Boughorbel ◽

Scott Presnell ◽

Charlie Quinn ◽

Damien Chaussabel

Keyword(s):

Gene Expression ◽

Web Application ◽

Large Scale ◽

Contextual Information ◽

Human Monocyte ◽

Relevant Group ◽

Interactive Data ◽

Public Repositories ◽

Data Browsing ◽

Public Datasets

Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online athttp://monocyte.gxbsidra.org/dm3/landing.gsp.

Download Full-text

Finding functional disease-associated non-coding variation using next-generation sequencing

10.1101/060285 ◽

2016 ◽

Author(s):

Paolo Devanna ◽

Xiaowei Sylvia Chen ◽

Joses Ho ◽

Dario Gajewski ◽

Alessandro Gialluisi ◽

...

Keyword(s):

Next Generation Sequencing ◽

Binding Sites ◽

Large Scale ◽

Sequence Data ◽

Whole Genome Sequence ◽

Next Generation Sequencing Data ◽

Whole Genome ◽

Next Generation ◽

Whole Exome ◽

Generation Sequencing

ABSTRACTNext generation sequencing has opened the way for the large scale interrogation of cohorts at the whole exome, or whole genome level. Currently, the field largely focuses on potential disease causing variants that fall within coding sequences and that are predicted to cause protein sequence changes, generally discarding non-coding variants. However non-coding DNA makes up ~98% of the genome and contains a range of sequences essential for controlling the expression of protein coding genes. Thus, potentially causative non-coding variation is currently being overlooked. To address this, we have designed an approach to assess variation in one class of non-coding regulatory DNA; the 3′UTRome. Variants in the 3'UTR region of genes are of particular interest because 3'UTRs are responsible for modulating protein expression levels via their interactions with microRNAs. Furthermore they are amenable to large scale analysis as 3′UTR-microRNA interactions are based on complementary base pairing and as such can be predicted in silico at the genome-wide level. We report a strategy for identifying and functionally testing variants in microRNA binding sites within the 3'UTRome and demonstrate the efficacy of this pipeline in a cohort of language impaired children. Using whole exome sequence data from 43 probands, we extracted variants that lay within 3'UTR microRNA binding sites. We identified a common variant (SNP) in a microRNA binding site and found this SNP to be associated with an endophenotype of language impairment (non-word repetition). We showed that this variant disrupted microRNA regulation in cells and was linked to altered gene expression in the brain, suggesting it may represent a risk factor contributing to SLI. This work demonstrates that biologically relevant variants are currently being under-investigated despite the wealth of next-generation sequencing data available and presents a simple strategy for interrogating non-coding regions of the genome. We propose that this strategy should be routinely applied to whole exome and whole genome sequence data in order to broaden our understanding of how non-coding genetic variation underlies complex phenotypes such as neurodevelopmental disorders.

Download Full-text

Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement

10.1101/299792 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lucas Czech ◽

Alexandros Stamatakis

Keyword(s):

Large Scale ◽

Sequence Data ◽

Sequence Similarity ◽

Computational Effort ◽

Supplementary Information ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Studies ◽

Manual Selection ◽

Supplementary Material

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets

Briefings in Bioinformatics ◽

10.1093/bib/bbaa033 ◽

2020 ◽

Author(s):

Alba Gutiérrez-Sacristán ◽

Carlos De Niz ◽

Cartik Kothari ◽

Sek Won Kong ◽

Kenneth D Mandl ◽

...

Keyword(s):

Next Generation Sequencing ◽

Web Application ◽

Large Scale ◽

Human Subjects ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Phenotypic Data ◽

Data Repositories ◽

Generation Sequencing

Abstract Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient’s individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine’s main objective—ensuring the optimum diagnosis, treatment and prognosis for each individual—investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data—and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).

Download Full-text

KinomeX: a web application for predicting kinome-wide polypharmacology effect of small molecules

Bioinformatics ◽

10.1093/bioinformatics/btz519 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5354-5356 ◽

Cited By ~ 5

Author(s):

Zhaojun Li ◽

Xutong Li ◽

Xiaohong Liu ◽

Zunyun Fu ◽

Zhaoping Xiong ◽

...

Keyword(s):

Small Molecules ◽

Web Application ◽

Large Scale ◽

Interaction Network ◽

Supplementary Information ◽

Chemical Structures ◽

In Silico Drug Design ◽

Daunting Task ◽

Data Points ◽

Experimental Validations

Abstract Motivation The large-scale kinome-wide virtual profiling for small molecules is a daunting task by experimental and traditional in silico drug design approaches. Recent advances in deep learning algorithms have brought about new opportunities in promoting this process. Results KinomeX is an online platform to predict kinome-wide polypharmacology effect of small molecules based solely on their chemical structures. The prediction is made by a multi-task deep neural network model trained with over 140 000 bioactivity data points for 391 kinases. Extensive computational and experimental validations have been performed. Overall, KinomeX enables users to create a comprehensive kinome interaction network for designing novel chemical modulators, and is of practical value on exploring the previously less studied or untargeted kinases. Availability and implementation KinomeX is available at: https://kinome.dddc.ac.cn. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SAINT: automatic taxonomy embedding and categorization by Siamese triplet network

10.1101/2021.01.20.426920 ◽

2021 ◽

Author(s):

Yang Young Lu ◽

Yiwen Wang ◽

Fang Zhang ◽

Jiaxing Bai ◽

Ying Wang

Keyword(s):

Sequence Analysis ◽

Sequence Comparison ◽

Large Scale ◽

Sequence Data ◽

Comparison Method ◽

Supplementary Information ◽

Data Alignment ◽

Alignment Free ◽

Comparison Methods ◽

Real World Datasets

AbstractMotivationUnderstanding the phylogenetic relationship among organisms is the key in contemporary evolutionary study and sequence analysis is the workhorse towards this goal. Conventional approaches to sequence analysis are based on sequence alignment, which is neither scalable to large-scale datasets due to computational inefficiency nor adaptive to next-generation sequencing (NGS) data. Alignment-free approaches are typically used as computationally effective alternatives yet still suffering the high demand of memory consumption. One desirable sequence comparison method at large-scale requires succinctly-organized sequence data management, as well as prompt sequence retrieval given a never-before-seen sequence as query.ResultsIn this paper, we proposed a novel approach, referred to as SAINT, for efficient and accurate alignment-free sequence comparison. Compared to existing alignment-free sequence comparison methods, SAINT offers advantages in two aspects: (1) SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; (2) SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences. We have applied SAINT to real-world datasets to demonstrate its empirical utility, both qualitatively and quantitatively. Considering the extensive applicability of alignment-free sequence comparison methods, we expect SAINT to motivate a more extensive set of applications in sequence comparison at large scale.AvailabilityThe open source, Apache licensed, python-implemented code will be available upon acceptance.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text