Read mapping

Steve Hoffmann; Peter F. Stadler

doi:10.1515/itit-2015-0046

Factorbook Motif Pipeline: A de novo motif discovery and filtering web server for ChIP-seq peaks

10.1101/033670 ◽

2015 ◽

Cited By ~ 1

Author(s):

Bong-Hyun Kim ◽

Jiali Zhuang ◽

Jie Wang ◽

Zhiping Weng

Keyword(s):

Motif Discovery ◽

High Throughput Sequencing ◽

De Novo ◽

Statistical Tests ◽

Web Server ◽

Biological Processes ◽

Web Based ◽

Sequencing Technologies ◽

De Novo Motif Discovery

Summary: High-throughput sequencing technologies such as ChIP-seq have deepened our understanding in many biological processes. De novo motif search is one of the key downstream computational analysis following the ChIP-seq experiments and several algorithms have been proposed for this purpose. However, most web-based systems do not perform independent filtering or enrichment analyses to ensure the quality of the discovered motifs. Here, we developed a web server Factorbook Motif Pipeline based on an algorithm used in analyzing ENCODE consortium ChIP-seq datasets. It performs comprehensive analysis on the set of peaks detected from a ChIP-seq experiments: (i) de novo motif discovery; (ii) independent composition and bias analyses and (iii) matching to the annotated motifs. The statistical tests employed in our pipeline provide a reliable measure of confidence as to how significant are the motifs reported in the discovery step. Availability: Factorbook Motif Pipeline source code is accessible through the following URL. https://github.com/joshuabhk/factorbook-motif-pipeline

Download Full-text

Bats, bacteria and their role in health and disease

Microbiology Australia ◽

10.1071/ma17009 ◽

2017 ◽

Vol 38 (1) ◽

pp. 28 ◽

Cited By ~ 1

Author(s):

Kristin Mühldorfer

Keyword(s):

High Throughput Sequencing ◽

Habitat Preferences ◽

Data Sets ◽

Infectious Agents ◽

Emerging Pathogens ◽

Microbial Detection ◽

White Nose Syndrome ◽

Sequencing Technologies ◽

Health And Disease ◽

Reservoir Hosts

Bats are ancient and among the most diverse mammals in terms of species richness, diet and habitat preferences, characteristics that may contribute to a high diversity of infectious agents. During the past two decades, the interest in bats and their microorganisms largely increased because of their role as reservoir hosts or carriers of important pathogens. Rapid advances in microbial detection and characterisation by high-throughput sequencing technologies have led to large genetic data sets but also improved our possibilities and speed of identifying unknown infectious agents. Assessing the risk of infectious diseases in bats and their pathological manifestation, however, is still challenging because of limited access to appropriate material and field data, and continuing limitations in wildlife diagnostics and the interpretation of genetic results. As a consequence, emerging pathogens can suddenly appear with devastating effects as happened for the white nose syndrome. To date, much research on bats and infectious agents still focusses on viruses, whilst the knowledge on bacteria and their role in disease is comparatively low.

Download Full-text

Comparison of Read Mapping and Variant Calling Tools for the Analysis of Plant NGS Data

Plants ◽

10.3390/plants9040439 ◽

2020 ◽

Vol 9 (4) ◽

pp. 439 ◽

Cited By ~ 3

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Performance Metrics ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data

High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

Multi-omics annotation of human long non-coding RNAs

Biochemical Society Transactions ◽

10.1042/bst20191063 ◽

2020 ◽

Vol 48 (4) ◽

pp. 1545-1556 ◽

Cited By ~ 2

Author(s):

Qianpeng Li ◽

Zhao Li ◽

Changrui Feng ◽

Shuai Jiang ◽

Zhang Zhang ◽

...

Keyword(s):

Human Genome ◽

Functional Annotation ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Rna World ◽

Omics Data ◽

Biological Processes ◽

Sequencing Technologies ◽

Non Coding Rnas ◽

Powerful Strategy

LncRNAs (long non-coding RNAs) are pervasively transcribed in the human genome and also extensively involved in a variety of essential biological processes and human diseases. The comprehensive annotation of human lncRNAs is of great significance in navigating the functional landscape of the human genome and deepening the understanding of the multi-featured RNA world. However, the unique characteristics of lncRNAs as well as their enormous quantity have complicated and challenged the annotation of lncRNAs. Advances in high-throughput sequencing technologies give rise to a large volume of omics data that are generated at an unprecedented rate and scale, providing possibilities in the identification, characterization and functional annotation of lncRNAs. Here, we review the recent important discoveries of human lncRNAs through analysis of various omics data and summarize specialized lncRNA database resources. Moreover, we highlight the multi-omics integrative analysis as a powerful strategy to efficiently discover and characterize the functional lncRNAs and elucidate their potential molecular mechanisms.

Download Full-text

Tools and best practices for retrotransposon analysis using high-throughput sequencing data

Mobile DNA ◽

10.1186/s13100-019-0192-1 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 4

Author(s):

Aurélie Teissandier ◽

Nicolas Servant ◽

Emmanuel Barillot ◽

Deborah Bourc’his

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Reference Genome ◽

Repetitive Sequences ◽

Simulated Data ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

Abstract Background Sequencing technologies give access to a precise picture of the molecular mechanisms acting upon genome regulation. One of the biggest technical challenges with sequencing data is to map millions of reads to a reference genome. This problem is exacerbated when dealing with repetitive sequences such as transposable elements that occupy half of the mammalian genome mass. Sequenced reads coming from these regions introduce ambiguities in the mapping step. Therefore, applying dedicated parameters and algorithms has to be taken into consideration when transposable elements regulation is investigated with sequencing datasets. Results Here, we used simulated reads on the mouse and human genomes to define the best parameters for aligning transposable element-derived reads on a reference genome. The efficiency of the most commonly used aligners was compared and we further evaluated how transposable element representation should be estimated using available methods. The mappability of the different transposon families in the mouse and the human genomes was calculated giving an overview into their evolution. Conclusions Based on simulated data, we provided recommendations on the alignment and the quantification steps to be performed when transposon expression or regulation is studied, and identified the limits in detecting specific young transposon families of the mouse and human genomes. These principles may help the community to adopt standard procedures and raise awareness of the difficulties encountered in the study of transposable elements.

Download Full-text

Comparison of read mapping and variant calling tools for the analysis of plant NGS data

10.1101/2020.03.10.986059 ◽

2020 ◽

Author(s):

Hanna Marie Schilbert ◽

Andreas Rempel ◽

Boas Pucker

Keyword(s):

High Throughput Sequencing ◽

Model Organism ◽

Variant Calling ◽

Reference Sequence ◽

Read Mapping ◽

The Past ◽

Sequencing Technologies ◽

Plant Sciences ◽

Ngs Data ◽

Real Plant

AbstractHigh-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.

Download Full-text

EMBED: a low dimensional reconstruction of gut microbiome dynamics based on ecological normal modes

10.1101/2021.03.18.436036 ◽

2021 ◽

Author(s):

Mayar Shahin ◽

Brian W Ji ◽

Purushottam D Dixit

Keyword(s):

Dimensionality Reduction ◽

Gut Microbiome ◽

High Throughput Sequencing ◽

Complex Dynamics ◽

Normal Modes ◽

Ecological Factors ◽

Data Sets ◽

Sequencing Technologies ◽

Dimensional Reconstruction ◽

Low Dimensional

The gut microbiome is well-established to be a significant driver of host health and disease. Longitudinal studies involving high-throughput sequencing technologies have begun to unravel the complex dynamics of these ecosystems, and quantitative frameworks are now being developed to better understand their organizing principles. Dimensionality reduction can offer unique insights into gut bacterial dynamics by leveraging collective abundance fluctuations of multiple bacteria driven by similar underlying ecological factors. However, methods providing lower-dimensional representations of gut microbial dynamics both at the community and individual taxa level are currently missing. To that end, we develop EMBED: Essential Microbiome Dynamics. Similar to normal modes in structural biology, EMBED infers ecological normal modes (ECNs), which represent the unique set of orthogonal dynamical trajectories capturing the collective behavior of a community. We show that a small number of ECNs accurately describe gut microbiome dynamics across data sets that encompass dietary changes and antibiotic-related perturbations. Importantly, we find that ECNs often reflect specific ecological behaviors, providing natural templates along which the dynamics of individual bacteria may be partitioned. Collectively, our results highlight the utility of dimensionality reduction approaches to understanding the dynamics of the gut microbiome and provide a framework to study the dynamics of other high-dimensional systems as well.

Download Full-text

Reconstructing evolutionary timescales using phylogenomics

10.7287/peerj.preprints.2403v1 ◽

2016 ◽

Author(s):

K. Jun Tong ◽

Nathan Lo ◽

Simon Y W Ho

Keyword(s):

High Throughput ◽

Molecular Clock ◽

Evolutionary Biology ◽

High Throughput Sequencing ◽

Genetic Data ◽

Biological Information ◽

Data Sets ◽

Sequencing Technology ◽

Genome Scale ◽

Scale Data

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.

Download Full-text

An Open-Source Toolkit To Expand Bioinformatics Training in Infectious Diseases

mBio ◽

10.1128/mbio.01214-21 ◽

2021 ◽

Author(s):

Alexander S. F. Berry ◽

Camila Farias Amorim ◽

Corbett L. Berry ◽

Camille M. Syrett ◽

Elise D. English ◽

...

Keyword(s):

Infectious Disease ◽

Data Analysis ◽

High Throughput Sequencing ◽

Data Sets ◽

Data Generation ◽

Sequencing Data ◽

Sequencing Technology ◽

Didactic Instruction ◽

Host Parasite Interactions ◽

Host Parasite

As access to high-throughput sequencing technology has increased, the bottleneck in biomedical research has shifted from data generation to data analysis. Here, we describe a modular and extensible framework for didactic instruction in bioinformatics using publicly available RNA sequencing data sets from infectious disease studies, with a focus on host-parasite interactions.

Download Full-text

Reconstructing evolutionary timescales using phylogenomics

10.7287/peerj.preprints.2403 ◽

2016 ◽

Author(s):

K. Jun Tong ◽

Nathan Lo ◽

Simon Y W Ho

Keyword(s):

High Throughput ◽

Molecular Clock ◽

Evolutionary Biology ◽

High Throughput Sequencing ◽

Genetic Data ◽

Biological Information ◽

Data Sets ◽

Sequencing Technology ◽

Genome Scale ◽

Scale Data

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.

Download Full-text