scholarly journals HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads

2017 ◽  
Author(s):  
Stefano Beretta ◽  
Murray D Patterson ◽  
Simone Zaccaria ◽  
Gianluca Della Vedova ◽  
Paola Bonizzoni

AbstractBackgroundHaplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions. Unfortunately, current state-of-the-art dynamic programming approaches designed for long reads deal only with limited coverages.ResultsHere, we propose a new method for assembling haplotypes which combines and extends the features of previous approaches to deal with long reads and higher coverages. In particular, our algorithm is able to dynamically adapt the estimated number of errors at each variant site, while minimizing the total number of error corrections necessary for finding a feasible solution. This allows our method to significantly reduce the required computational resources, allowing to consider datasets composed of higher coverages. The algorithm has been implemented in a freely available tool, HapCHAT: Haplotype Assembly Coverage Handling by Adapting Thresholds. An experimental analysis on sequencing reads with up to 60× coverage reveals improvements in accuracy and recall achieved by considering a higher coverage with lower runtimes.ConclusionsOur method leverages the long-range information of sequencing reads that allows to obtain assembled haplotypes fragmented in a lower number of unphased haplotype blocks. At the same time, our method is also able to deal with higher coverages to better correct the errors in the original reads and to obtain more accurate haplotypes as a result.AvailabilityHapCHAT is available at http://hapchat.algolab.eu under the GPL license.

2019 ◽  
Author(s):  
Alberto Magi

AbstractBackgroundHuman genomes are diploid, which means they have two homologous copies of each chromosome and the assignment of heterozygous variants to each chromosome copy, the haplotype assembly problem, is of fundamental importance for medical and population genetics.While short reads from second generation sequencing platforms drastically limit haplotype reconstruction as the great majority of reads do not allow to link many variants together, novel long reads from third generation sequencing can span several variants along the genome allowing to infer much longer haplotype blocks.However, the great majority of haplotype assembly algorithms, originally devised for short sequences, fail when they are applied to noisy long reads data, and although novel algorithm have been properly developed to deal with the properties of this new generation of sequences, these methods are capable to manage only datasets with limited coverages.ResultsTo overcome the limits of currently available algorithms, I propose a novel formulation of the single individual haplotype assembly problem, based on maximum allele co-occurrence (MAC) and I develop an ultra-fast algorithm that is capable to reconstruct the haplotype structure of a diploid genome from low- and high-coverage long read datasets with high accuracy. I test my algorithm (MAtCHap) on synthetic and real PacBio and Nanopore human dataset and I compare its result with other eight state-of-the-art algorithms. All the results obtained by these analyses show that MAtCHap outperforms other methods in terms of accuracy, contiguity, completeness and computational speed.AvailabilityMAtCHap is publicly available at https://sourceforge.net/projects/matchap/.


2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Stefano Beretta ◽  
Murray D. Patterson ◽  
Simone Zaccaria ◽  
Gianluca Della Vedova ◽  
Paola Bonizzoni

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Shaya Akbarinejad ◽  
Mostafa Hadadian Nejad Yousefi ◽  
Maziar Goudarzi

Abstract Background Once aligned, long-reads can be a useful source of information to identify the type and position of structural variations. However, due to the high sequencing error of long reads, long-read structural variation detection methods are far from precise in low-coverage cases. To be accurate, they need to use high-coverage data, which in turn, results in an extremely time-consuming pipeline, especially in the alignment phase. Therefore, it is of utmost importance to have a structural variation calling pipeline which is both fast and precise for low-coverage data. Results In this paper, we present SVNN, a fast yet accurate, structural variation calling pipeline for PacBio long-reads that takes raw reads as the input and detects structural variants of size larger than 50 bp. Our pipeline utilizes state-of-the-art long-read aligners, namely NGMLR and Minimap2, and structural variation callers, videlicet Sniffle and SVIM. We found that by using a neural network, we can extract features from Minimap2 output to detect a subset of reads that provide useful information for structural variation detection. By only mapping this subset with NGMLR, which is far slower than Minimap2 but better serves downstream structural variation detection, we can increase the sensitivity in an efficient way. As a result of using multiple tools intelligently, SVNN achieves up to 20 percentage points of sensitivity improvement in comparison with state-of-the-art methods and is three times faster than a naive combination of state-of-the-art tools to achieve almost the same accuracy. Conclusion Since prohibitive costs of using high-coverage data have impeded long-read applications, with SVNN, we provide the users with a much faster structural variation detection platform for PacBio reads with high precision and sensitivity in low-coverage scenarios.


Author(s):  
Yaser Jararweh ◽  
Moath Jarrah ◽  
Abdelkader Bousselham

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.


Author(s):  
David Radke ◽  
Anna Hessler ◽  
Dan Ellsworth

Destructive wildfires result in billions of dollars in damage each year and are expected to increase in frequency, duration, and severity due to climate change. The current state-of-the-art wildfire spread models rely on mathematical growth predictions and physics-based models, which are difficult and computationally expensive to run. We present and evaluate a novel system, FireCast. FireCast combines artificial intelligence (AI) techniques with data collection strategies from geographic information systems (GIS). FireCast predicts which areas surrounding a burning wildfire have high-risk of near-future wildfire spread, based on historical fire data and using modest computational resources. FireCast is compared to a random prediction model and a commonly used wildfire spread model, Farsite, outperforming both with respect to total accuracy, recall, and F-score.


2016 ◽  
pp. 2373-2384
Author(s):  
Yaser Jararweh ◽  
Moath Jarrah ◽  
Abdelkader Bousselham

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.


Author(s):  
Christian Meilicke ◽  
Melisachew Wudage Chekol ◽  
Daniel Ruffinelli ◽  
Heiner Stuckenschmidt

We propose an anytime bottom-up technique for learning logical rules from large knowledge graphs. We apply the learned rules to predict candidates in the context of knowledge graph completion. Our approach outperforms other rule-based approaches and it is competitive with current state of the art, which is based on latent representations. Besides, our approach is significantly faster, requires less computational resources, and yields an explanation in terms of the rules that propose a candidate.


1995 ◽  
Vol 38 (5) ◽  
pp. 1126-1142 ◽  
Author(s):  
Jeffrey W. Gilger

This paper is an introduction to behavioral genetics for researchers and practioners in language development and disorders. The specific aims are to illustrate some essential concepts and to show how behavioral genetic research can be applied to the language sciences. Past genetic research on language-related traits has tended to focus on simple etiology (i.e., the heritability or familiality of language skills). The current state of the art, however, suggests that great promise lies in addressing more complex questions through behavioral genetic paradigms. In terms of future goals it is suggested that: (a) more behavioral genetic work of all types should be done—including replications and expansions of preliminary studies already in print; (b) work should focus on fine-grained, theory-based phenotypes with research designs that can address complex questions in language development; and (c) work in this area should utilize a variety of samples and methods (e.g., twin and family samples, heritability and segregation analyses, linkage and association tests, etc.).


1976 ◽  
Vol 21 (7) ◽  
pp. 497-498
Author(s):  
STANLEY GRAND

10.37236/24 ◽  
2002 ◽  
Vol 1000 ◽  
Author(s):  
A. Di Bucchianico ◽  
D. Loeb

We survey the mathematical literature on umbral calculus (otherwise known as the calculus of finite differences) from its roots in the 19th century (and earlier) as a set of “magic rules” for lowering and raising indices, through its rebirth in the 1970’s as Rota’s school set it on a firm logical foundation using operator methods, to the current state of the art with numerous generalizations and applications. The survey itself is complemented by a fairly complete bibliography (over 500 references) which we expect to update regularly.


Sign in / Sign up

Export Citation Format

Share Document