scholarly journals Genome scaffolding with PE-contaminated mate-pair libraries

2015 ◽  
Author(s):  
Kristoffer Sahlin ◽  
Rayan Chikhi ◽  
Lars Arvestad

Scaffolding is often an essential step in a genome assembly process,in which contigs are ordered and oriented using read pairs from a combination of paired-ends libraries and longer-range mate-pair libraries. Although a simple idea, scaffolding is unfortunately hard to get right in practice. One source of problem is so-called PE-contamination in mate-pair libraries, in which a non-negligible fraction of the read pairs get the wrong orientation and a much smaller insert size than what is expected. This contamination has been discussed in previous work on integrated scaffolders in end-to-end assemblers such as Allpaths-LG and MaSuRCA but the methods relies on the fact that the orientation is observable, \emph{e.g.}, by finding the junction adapter sequence in the reads. This is not always the case, making orientation and insert size of a read pair stochastic. Furthermore, work on modeling PE-contamination has so far been disregarded in stand-alone scaffolders and the effect that PE-contamination has on scaffolding quality has not been examined before. We have addressed PE-contamination in an update of our scaffolder BESST. We formulate the problem as an Integer Linear Program (ILP) and use characteristics of the problem, such as contig lengths and insert size, to efficiently solve the ILP using a linear amount (with respect to the number of contigs) of Linear Programs. Our results show significant improvement over both integrated and standalone scaffolders. The impact of modeling PE-contamination is quantified by comparison with the previous BESST model. We also show how other scaffolders are vulnerable to PE-contaminated libraries, resulting in increased number of misassemblies, more conservative scaffolding, and inflated assembly sizes. The model is implemented in BESST. Source code and usage instructions are found at https://github.com/ksahlin/BESST. BESST can also be downloaded using PyPI.

2021 ◽  
Author(s):  
Matthias Zytnicki ◽  
Christine Gaspin

AbstractMotivationSequencing is the key method to study the impact of short RNAs, which include micro RNAs, tRNA-derived RNAs, and piwi-interacting RNA, among other. The first step to make use of these reads is to map them to a genome. Existing mapping tools have been developed for the long RNAs in mind, and, so far, no tool has been conceived for short RNAs. However, short RNAs have several distinctive features which make them different from messenger RNAs: they are shorter (not greater than 200bp), they often redundant, they can be produced by duplicated loci, and they may be edited at their ends.ResultsIn this work, we present a new tool, srnaMapper, that maps these reads with all these objectives in mind. We show on two data sets that srnaMapper is more efficient considering computation time and edition error handling: it quickly retrieves all the hits, with arbitrary number of errors.AvailabilitysrnaMapper source code is available at https://github.com/mzytnicki/[email protected]


Author(s):  
Tran Thanh Luong ◽  
Le My Canh

JavaScript has become more and more popular in recent years because its wealthy features as being dynamic, interpreted and object-oriented with first-class functions. Furthermore, JavaScript is designed with event-driven and I/O non-blocking model that boosts the performance of overall application especially in the case of Node.js. To take advantage of these characteristics, many design patterns that implement asynchronous programming for JavaScript were proposed. However, choosing a right pattern and implementing a good asynchronous source code is a challenge and thus easily lead into less robust application and low quality source code. Extended from our previous works on exception handling code smells in JavaScript and exception handling code smells in JavaScript asynchronous programming with promise, this research aims at studying the impact of three JavaScript asynchronous programming patterns on quality of source code and application.


Author(s):  
Daniella F Lato ◽  
G Brian Golding

Abstract Increasing evidence supports the notion that different regions of a genome have unique rates of molecular change. This variation is particularly evident in bacterial genomes where previous studies have reported gene expression and essentiality tend to decrease, while substitution rates usually increases with increasing distance from the origin of replication. Genomic reorganization such as rearrangements occur frequently in bacteria and allow for the introduction and restructuring of genetic content, creating gradients of molecular traits along genomes. Here, we explore the interplay of these phenomena by mapping substitutions to the genomes of Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti, quantifying how many substitutions have occurred at each position in the genome. Preceding work indicates that substitution rate significantly increases with distance from the origin. Using a larger sample size and accounting for genome rearrangements through ancestral reconstruction, our analysis demonstrates that the correlation between the number of substitutions and distance from the origin of replication is often significant but small and inconsistent in direction. Some replicons had a significantly decreasing trend (E. coli and the chromosome of S. meliloti), while others showed the opposite significant trend (B. subtilis, Streptomyces, pSymA and pSymB in S. meliloti). dN, dS and ω were examined across all genes and there was no significant correlation between those values and distance from the origin. This study highlights the impact that genomic rearrangements and location have on molecular trends in some bacteria, illustrating the importance of considering spatial trends in molecular evolutionary analysis. Assuming that molecular trends are exclusively in one direction can be problematic.


Author(s):  
Agnieszka B. Wegrzyn ◽  
Sarah Stolle ◽  
Rienk A. Rienksma ◽  
Vítor A.P. Martins dos Santos ◽  
Barbara M. Bakker ◽  
...  
Keyword(s):  
A Genome ◽  

2020 ◽  
Vol 36 (12) ◽  
pp. 3687-3692 ◽  
Author(s):  
Christopher Pockrandt ◽  
Mai Alzamel ◽  
Costas S Iliopoulos ◽  
Knut Reinert

Abstract Motivation Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches. Results We present a fast method GenMap to compute the (k, e)-mappability. We extend the mappability algorithm, such that it can also be computed across multiple genomes where a k-mer occurrence is only counted once per genome. This allows for the computation of marker sequences or finding candidates for probe design by identifying approximate k-mers that are unique to a genome or that are present in all genomes. GenMap supports different formats such as binary output, wig and bed files as well as csv files to export the location of all approximate k-mers for each genomic position. Availability and implementation GenMap can be installed via bioconda. Binaries and C++ source code are available on https://github.com/cpockrandt/genmap.


2016 ◽  
Author(s):  
Bethany Signal ◽  
Brian S Gloss ◽  
Marcel E Dinger ◽  
Timothy R Mercer

ABSTRACTBackgroundThe branchpoint element is required for the first lariat-forming reaction in splicing. However due to difficulty in experimentally mapping at a genome-wide scale, current catalogues are incomplete.ResultsWe have developed a machine-learning algorithm trained with empirical human branchpoint annotations to identify branchpoint elements from primary genome sequence alone. Using this approach, we can accurately locate branchpoints elements in 85% of introns in current gene annotations. Consistent with branchpoints as basal genetic elements, we find our annotation is unbiased towards gene type and expression levels. A major fraction of introns was found to encode multiple branchpoints raising the prospect that mutational redundancy is encoded in key genes. We also confirmed all deleterious branchpoint mutations annotated in clinical variant databases, and further identified thousands of clinical and common genetic variants with similar predicted effects.ConclusionsWe propose the broad annotation of branchpoints constitutes a valuable resource for further investigations into the genetic encoding of splicing patterns, and interpreting the impact of common- and disease-causing human genetic variation on gene splicing.


2021 ◽  
Author(s):  
Marco Luca Sbodio ◽  
Natasha Mulligan ◽  
Stefanie Speichert ◽  
Vanessa Lopez ◽  
Joao Bettencourt-Silva

There is a growing trend in building deep learning patient representations from health records to obtain a comprehensive view of a patient’s data for machine learning tasks. This paper proposes a reproducible approach to generate patient pathways from health records and to transform them into a machine-processable image-like structure useful for deep learning tasks. Based on this approach, we generated over a million pathways from FAIR synthetic health records and used them to train a convolutional neural network. Our initial experiments show the accuracy of the CNN on a prediction task is comparable or better than other autoencoders trained on the same data, while requiring significantly less computational resources for training. We also assess the impact of the size of the training dataset on autoencoders performances. The source code for generating pathways from health records is provided as open source.


2021 ◽  
Vol 17 (9) ◽  
pp. e1009317
Author(s):  
Ilario De Toma ◽  
Cesar Sierra ◽  
Mara Dierssen

Trisomy of human chromosome 21 (HSA21) causes Down syndrome (DS). The trisomy does not simply result in the upregulation of HSA21--encoded genes but also leads to a genome-wide transcriptomic deregulation, which affect differently each tissue and cell type as a result of epigenetic mechanisms and protein-protein interactions. We performed a meta-analysis integrating the differential expression (DE) analyses of all publicly available transcriptomic datasets, both in human and mouse, comparing trisomic and euploid transcriptomes from different sources. We integrated all these data in a “DS network”. We found that genome wide deregulation as a consequence of trisomy 21 is not arbitrary, but involves deregulation of specific molecular cascades in which both HSA21 genes and HSA21 interactors are more consistently deregulated compared to other genes. In fact, gene deregulation happens in “clusters”, so that groups from 2 to 13 genes are found consistently deregulated. Most of these events of “co-deregulation” involve genes belonging to the same GO category, and genes associated with the same disease class. The most consistent changes are enriched in interferon related categories and neutrophil activation, reinforcing the concept that DS is an inflammatory disease. Our results also suggest that the impact of the trisomy might diverge in each tissue due to the different gene set deregulation, even though the triplicated genes are the same. Our original method to integrate transcriptomic data confirmed not only the importance of known genes, such as SOD1, but also detected new ones that could be extremely useful for generating or confirming hypotheses and supporting new putative therapeutic candidates. We created “metaDEA” an R package that uses our method to integrate every kind of transcriptomic data and therefore could be used with other complex disorders, such as cancer. We also created a user-friendly web application to query Ensembl gene IDs and retrieve all the information of their differential expression across the datasets.


2019 ◽  
Vol 8 (3) ◽  
pp. 332 ◽  
Author(s):  
Chia-Shan Hsieh ◽  
Pang-Shuo Huang ◽  
Sheng-Nan Chang ◽  
Cho-Kai Wu ◽  
Juey-Jen Hwang ◽  
...  

Atrial fibrillation (AF) is a common cardiac arrhythmia and is one of the major causes of ischemic stroke. In addition to the clinical factors such as CHADS2 or CHADS2-VASC score, the impact of genetic factors on the risk of thromboembolic stroke in patients with AF has been largely unknown. Single-nucleotide polymorphisms in several genomic regions have been found to be associated with AF. However, these loci do not contribute to all the genetic risks of AF or AF related thromboembolic risks, suggesting that there are other genetic factors or variants not yet discovered. In the human genome, copy number variations (CNVs) could also contribute to disease susceptibility. In the present study, we sought to identify CNVs determining the AF-related thromboembolic risk. Using a genome-wide approach in 109 patients with AF and thromboembolic stroke and 14,666 controls from the Taiwanese general population (Taiwan Biobank), we first identified deletions in chromosomal regions 1p36.32-1p36.33, 5p15.33, 8q24.3 and 19p13.3 and amplifications in 14q11.2 that were significantly associated with AF-related stroke in the Taiwanese population. In these regions, 148 genes were involved, including several microRNAs and long non-recoding RNAs. Using a pathway analysis, we found deletions in GNB1, PRKCZ, and GNG7 genes related to the alpha-adrenergic receptor signaling pathway that play a major role in determining the risk of an AF-related stroke. In conclusion, CNVs may be genetic predictors of a risk of a thromboembolic stroke for patients with AF, possibly pointing to an impaired alpha-adrenergic signaling pathway in the mechanism of AF-related thromboembolism.


Author(s):  
Toby E. Newman ◽  
Silke Jacques ◽  
Christy Grime ◽  
Fiona L. Kamphuis ◽  
Robert C. Lee ◽  
...  

Chickpea production is constrained worldwide by the necrotrophic fungal pathogen Ascochyta rabiei, the causal agent of ascochyta blight (AB). In order to reduce the impact of this disease, novel sources of resistance are required in chickpea cultivars. Here, we screened a new collection of wild Cicer accessions for AB resistance and identified accessions resistant to multiple, highly pathogenic isolates. In addition to this, analyses demonstrated that some collection sites of Cicer echinospermum harbour predominantly resistant accessions, knowledge that can inform future collection missions. Furthermore, a genome-wide association study identified regions of the Cicer reticulatum genome associated with AB resistance and investigation of these regions identified candidate resistance genes. Taken together, these results can be utilised to enhance the resistance of chickpea cultivars to this globally yield-limiting disease.


Sign in / Sign up

Export Citation Format

Share Document