Read mapping

2016 ◽  
Vol 58 (3) ◽  
Author(s):  
Steve Hoffmann ◽  
Peter F. Stadler

AbstractThe Read Mapping problem asks for the exact origin of a nucleotide sequence in a reference genome. It translates to a conceptually simple approximate string matching problem. The practical difficulty, however, arises from the typical size of the data sets produced by modern high throughput sequencing technologies, from the biological processes involved in derivation of the query molecule from its genomic source, and from the technical processes of the sequencing technology itself.

2015 ◽  
Author(s):  
Bong-Hyun Kim ◽  
Jiali Zhuang ◽  
Jie Wang ◽  
Zhiping Weng

Summary: High-throughput sequencing technologies such as ChIP-seq have deepened our understanding in many biological processes. De novo motif search is one of the key downstream computational analysis following the ChIP-seq experiments and several algorithms have been proposed for this purpose. However, most web-based systems do not perform independent filtering or enrichment analyses to ensure the quality of the discovered motifs. Here, we developed a web server Factorbook Motif Pipeline based on an algorithm used in analyzing ENCODE consortium ChIP-seq datasets. It performs comprehensive analysis on the set of peaks detected from a ChIP-seq experiments: (i) de novo motif discovery; (ii) independent composition and bias analyses and (iii) matching to the annotated motifs. The statistical tests employed in our pipeline provide a reliable measure of confidence as to how significant are the motifs reported in the discovery step. Availability: Factorbook Motif Pipeline source code is accessible through the following URL. https://github.com/joshuabhk/factorbook-motif-pipeline


2017 ◽  
Vol 38 (1) ◽  
pp. 28 ◽  
Author(s):  
Kristin Mühldorfer

Bats are ancient and among the most diverse mammals in terms of species richness, diet and habitat preferences, characteristics that may contribute to a high diversity of infectious agents. During the past two decades, the interest in bats and their microorganisms largely increased because of their role as reservoir hosts or carriers of important pathogens. Rapid advances in microbial detection and characterisation by high-throughput sequencing technologies have led to large genetic data sets but also improved our possibilities and speed of identifying unknown infectious agents. Assessing the risk of infectious diseases in bats and their pathological manifestation, however, is still challenging because of limited access to appropriate material and field data, and continuing limitations in wildlife diagnostics and the interpretation of genetic results. As a consequence, emerging pathogens can suddenly appear with devastating effects as happened for the white nose syndrome. To date, much research on bats and infectious agents still focusses on viruses, whilst the knowledge on bacteria and their role in disease is comparatively low.


Plants ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 439 ◽  
Author(s):  
Hanna Marie Schilbert ◽  
Andreas Rempel ◽  
Boas Pucker

High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.


2020 ◽  
Vol 48 (4) ◽  
pp. 1545-1556 ◽  
Author(s):  
Qianpeng Li ◽  
Zhao Li ◽  
Changrui Feng ◽  
Shuai Jiang ◽  
Zhang Zhang ◽  
...  

LncRNAs (long non-coding RNAs) are pervasively transcribed in the human genome and also extensively involved in a variety of essential biological processes and human diseases. The comprehensive annotation of human lncRNAs is of great significance in navigating the functional landscape of the human genome and deepening the understanding of the multi-featured RNA world. However, the unique characteristics of lncRNAs as well as their enormous quantity have complicated and challenged the annotation of lncRNAs. Advances in high-throughput sequencing technologies give rise to a large volume of omics data that are generated at an unprecedented rate and scale, providing possibilities in the identification, characterization and functional annotation of lncRNAs. Here, we review the recent important discoveries of human lncRNAs through analysis of various omics data and summarize specialized lncRNA database resources. Moreover, we highlight the multi-omics integrative analysis as a powerful strategy to efficiently discover and characterize the functional lncRNAs and elucidate their potential molecular mechanisms.


Mobile DNA ◽  
2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Aurélie Teissandier ◽  
Nicolas Servant ◽  
Emmanuel Barillot ◽  
Deborah Bourc’his

Abstract Background Sequencing technologies give access to a precise picture of the molecular mechanisms acting upon genome regulation. One of the biggest technical challenges with sequencing data is to map millions of reads to a reference genome. This problem is exacerbated when dealing with repetitive sequences such as transposable elements that occupy half of the mammalian genome mass. Sequenced reads coming from these regions introduce ambiguities in the mapping step. Therefore, applying dedicated parameters and algorithms has to be taken into consideration when transposable elements regulation is investigated with sequencing datasets. Results Here, we used simulated reads on the mouse and human genomes to define the best parameters for aligning transposable element-derived reads on a reference genome. The efficiency of the most commonly used aligners was compared and we further evaluated how transposable element representation should be estimated using available methods. The mappability of the different transposon families in the mouse and the human genomes was calculated giving an overview into their evolution. Conclusions Based on simulated data, we provided recommendations on the alignment and the quantification steps to be performed when transposon expression or regulation is studied, and identified the limits in detecting specific young transposon families of the mouse and human genomes. These principles may help the community to adopt standard procedures and raise awareness of the difficulties encountered in the study of transposable elements.


2020 ◽  
Author(s):  
Hanna Marie Schilbert ◽  
Andreas Rempel ◽  
Boas Pucker

AbstractHigh-throughput sequencing technologies have rapidly developed during the past years and became an essential tool in plant sciences. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Sets of variants were evaluated based on various parameters including sensitivity and specificity. We found that all investigated tools are suitable for analysis of NGS data in plant research. When looking at different performance metrices, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.


2021 ◽  
Author(s):  
Mayar Shahin ◽  
Brian W Ji ◽  
Purushottam D Dixit

The gut microbiome is well-established to be a significant driver of host health and disease. Longitudinal studies involving high-throughput sequencing technologies have begun to unravel the complex dynamics of these ecosystems, and quantitative frameworks are now being developed to better understand their organizing principles. Dimensionality reduction can offer unique insights into gut bacterial dynamics by leveraging collective abundance fluctuations of multiple bacteria driven by similar underlying ecological factors. However, methods providing lower-dimensional representations of gut microbial dynamics both at the community and individual taxa level are currently missing. To that end, we develop EMBED: Essential Microbiome Dynamics. Similar to normal modes in structural biology, EMBED infers ecological normal modes (ECNs), which represent the unique set of orthogonal dynamical trajectories capturing the collective behavior of a community. We show that a small number of ECNs accurately describe gut microbiome dynamics across data sets that encompass dietary changes and antibiotic-related perturbations. Importantly, we find that ECNs often reflect specific ecological behaviors, providing natural templates along which the dynamics of individual bacteria may be partitioned. Collectively, our results highlight the utility of dimensionality reduction approaches to understanding the dynamics of the gut microbiome and provide a framework to study the dynamics of other high-dimensional systems as well.


2016 ◽  
Author(s):  
K. Jun Tong ◽  
Nathan Lo ◽  
Simon Y W Ho

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.


mBio ◽  
2021 ◽  
Author(s):  
Alexander S. F. Berry ◽  
Camila Farias Amorim ◽  
Corbett L. Berry ◽  
Camille M. Syrett ◽  
Elise D. English ◽  
...  

As access to high-throughput sequencing technology has increased, the bottleneck in biomedical research has shifted from data generation to data analysis. Here, we describe a modular and extensible framework for didactic instruction in bioinformatics using publicly available RNA sequencing data sets from infectious disease studies, with a focus on host-parasite interactions.


2016 ◽  
Author(s):  
K. Jun Tong ◽  
Nathan Lo ◽  
Simon Y W Ho

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.


Sign in / Sign up

Export Citation Format

Share Document