read pair
Recently Published Documents


TOTAL DOCUMENTS

9
(FIVE YEARS 0)

H-INDEX

2
(FIVE YEARS 0)

2018 ◽  
Author(s):  
Whitney Whitford ◽  
Klaus Lehnert ◽  
Russell G. Snell ◽  
Jessie C. Jacobsen

AbstractBackgroundWhole genome sequencing (WGS) has increased in popularity and decreased in cost over the past decade, rendering this approach as a viable and sensitive method for variant detection. In addition to its utility for single nucleotide variant detection, WGS data has the potential to detect Copy Number Variants (CNV) to fine resolution. Many CNV detection software packages have been developed exploiting four main types of data: read pair, split read, read depth, and assembly based methods. The aim of this study was to evaluate the efficiency of each of these main approaches in detecting deletions.MethodsWGS data and high confidence deletion calls for the individual NA12878 from the Genome in a Bottle consortium were the benchmark dataset. The performance of Breakdancer, CNVnator, Delly, FermiKit, and Pindel was assessed by comparing the accuracy and sensitivity of each software package in detecting deletions exceeding 1kb.ResultsThere was considerable variability in the outputs of the different WGS CNV detection programs. The best performance was seen from Breakdancer and Delly, with 92.6% and 96.7% sensitivity, respectively and 34.5% and 68.5% false discovery rate (FDR), respectively. In comparison, Pindel, CNVnator, and FermiKit were less effective with sensitivities of 69.1%, 66.0%, and 15.8%, respectively and FDR of 91.3%, 69.0%, and 31.7%, respectively. Concordance across software packages was poor, with only 27 of the total 612 benchmark deletions identified by all five methodologies.ConclusionsThe WGS based CNV detection tools evaluated show disparate performance in identifying deletions ≥1kb, particularly those utilising different input data characteristics. Software that exploits read pair based data had the highest sensitivity, namely Breakdancer and Delly. Breakdancer also had the second lowest false discovery rate. Therefore, in this analysis read pair methods (Breakdancer in particular) were the best performing approaches for the identification of deletions ≥1kb, balancing accuracy and sensitivity. There is potential for improvement in the detection algorithms, particularly for reducing FDR. This analysis has validated the utility of WGS based CNV detection software to reliably identify deletions, and these findings will be of use when choosing appropriate software for deletion detection, in both research and diagnostic medicine.


2018 ◽  
Vol 34 (21) ◽  
pp. 3631-3637
Author(s):  
Anish M S Shrestha ◽  
Naruki Yoshikawa ◽  
Kiyoshi Asai
Keyword(s):  

2017 ◽  
Vol 24 (6) ◽  
pp. 581-589 ◽  
Author(s):  
Kristoffer Sahlin ◽  
Mattias Frånberg ◽  
Lars Arvestad

2016 ◽  
Author(s):  
Michael G. Nelson ◽  
Raquel S. Linheiro ◽  
Casey M. Bergman

AbstractBackgroundTransposable element (TE) insertions are among the most challenging type of variants to detect in genomic data because of their repetitive nature and complex mechanisms of replication. Nevertheless, the recent availability of large resequencing datasets has spurred the development of many new methods to detect TE insertions in whole genome shotgun sequences. These methods generate output in diverse formats and have a large number of software and data dependencies, making their comparative evaluation challenging for potential users.ResultsHere we develop an integrated bioinformatics pipeline for the detection of TE insertions in whole genome shotgun data, called McClintock (https://github.com/bergmanlab/mcclintock), that automatically runs and generates standardized output for multiple TE detection methods. We demonstrate the utility of the McClintock system by performing comparative evaluation of six TE detection methods using simulated and real genome data from the model microbal eukaryote, Saccharomyces cerevisiae. We find substantial variation among McClintock component methods in their ability to detect non-reference insertions in the yeast genome, but show that non-reference TEs at nearly all biologically-realistic locations can be detected in simulated data by combining multiple methods that use split-read and read-pair evidence. In general, our results reveal that split-read methods detect fewer non-reference TE insertions than read-pair methods, but generally have much higher positional accuracy. Analysis of a large sample of real yeast genomes reveals that most, but not all, McClintock component methods can recover known aspects of TE biology in yeast such as the transpositional activity status of families, tRNA gene target preferences, and target site duplication structure, albeit with varying levels of positional accuracy.ConclusionsOur results suggest that no single TE detection method currently provides comprehensive detection of non-reference TEs, even in the context of a simplified model eukaryotic genome like S. cerevisiae. In spite of these limitations, the McClintock system provides a framework for testing, developing and integrating results from multiple TE detection methods to achieve this ultimate aim, as well as useful guidance for yeast researchers to select appropriate TE detection tools.


2016 ◽  
Author(s):  
Kristoffer Sahlin ◽  
Mattias Frånberg ◽  
Lars Arvestad

Abstract. Reads from paired-end and mate-pair libraries are often utilized to find structural variation in genomes, and one common approach is to use their fragment length for detection. After aligning read-pairs to the reference, read-pair distances are analyzed for statistically significant deviations. However, previously proposed methods are based on a simplified model of observed fragment lengths that does not agree with data. We show how this model limits statistical analysis of identifying variants and propose a new model, by adapting a model we have previously introduced for contig scaffolding, which agrees with data. From this model we derive an improved improved null hypothesis that, when applied in the variant caller CLEVER, reduces the number of false positives and corrects a bias that contributes to more deletion calls than insertion calls. A reference implementation is freely available at https://github.com/ksahlin/GetDistr.


2015 ◽  
Author(s):  
Kristoffer Sahlin ◽  
Rayan Chikhi ◽  
Lars Arvestad

Scaffolding is often an essential step in a genome assembly process,in which contigs are ordered and oriented using read pairs from a combination of paired-ends libraries and longer-range mate-pair libraries. Although a simple idea, scaffolding is unfortunately hard to get right in practice. One source of problem is so-called PE-contamination in mate-pair libraries, in which a non-negligible fraction of the read pairs get the wrong orientation and a much smaller insert size than what is expected. This contamination has been discussed in previous work on integrated scaffolders in end-to-end assemblers such as Allpaths-LG and MaSuRCA but the methods relies on the fact that the orientation is observable, \emph{e.g.}, by finding the junction adapter sequence in the reads. This is not always the case, making orientation and insert size of a read pair stochastic. Furthermore, work on modeling PE-contamination has so far been disregarded in stand-alone scaffolders and the effect that PE-contamination has on scaffolding quality has not been examined before. We have addressed PE-contamination in an update of our scaffolder BESST. We formulate the problem as an Integer Linear Program (ILP) and use characteristics of the problem, such as contig lengths and insert size, to efficiently solve the ILP using a linear amount (with respect to the number of contigs) of Linear Programs. Our results show significant improvement over both integrated and standalone scaffolders. The impact of modeling PE-contamination is quantified by comparison with the previous BESST model. We also show how other scaffolders are vulnerable to PE-contaminated libraries, resulting in increased number of misassemblies, more conservative scaffolding, and inflated assembly sizes. The model is implemented in BESST. Source code and usage instructions are found at https://github.com/ksahlin/BESST. BESST can also be downloaded using PyPI.


Sign in / Sign up

Export Citation Format

Share Document