scholarly journals A comparison of three programming languages for a full-fledged next-generation sequencing tool

2019 ◽  
Author(s):  
Costanza Pascal ◽  
Herzeel Charlotte ◽  
Verachtert Wilfried

AbstractBackgroundelPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.ResultsThe Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case.ConclusionsBased on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

2022 ◽  
Vol 1 ◽  
pp. 01003 ◽  
Author(s):  
Aleksandr P. Polishchuk ◽  
Sergei A. Semerikov

The tasks for which computers were created - routine calculations of an industrial, scientific and military nature - required the creation of a whole class of new methods focused not on manual but on machine calculations. The first programming languages did not have convenient means for reflecting such objects often used in computational mathematics as matrices, vectors, polynomials, etc. Further development of programming languages followed the path of embedding mathematical objects into languages as data types, which led to their complication. So, for example, an attempt to make a universal language Ada, in which there are even such data types as dictionaries and queues, led to the fact that the number of keywords in it exceeded 350, making it almost unusable for learning and use. The compromise solution between these two extremes can be the following: let the programmer himself create the data types that he needs in his professional work. Programming languages that implement this approach are called object-oriented. This, on the one hand, makes it possible to make the language quite easy by reducing the number of keywords, and on the other, expandable, adapting to specific tasks by introducing keywords for creating and using new data types.


2020 ◽  
Vol 15 ◽  
Author(s):  
Hongdong Li ◽  
Wenjing Zhang ◽  
Yuwen Luo ◽  
Jianxin Wang

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.


Biomedicines ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 719
Author(s):  
Julia Hupfeld ◽  
Maximilian Ernst ◽  
Maria Knyrim ◽  
Stephanie Binas ◽  
Udo Kloeckner ◽  
...  

MicroRNAs (miRs) contribute to different aspects of cardiovascular pathology, among them cardiac hypertrophy and atrial fibrillation. Cardiac miR expression was analyzed in a mouse model with structural and electrical remodeling. Next-generation sequencing revealed that miR-208b-3p was ~25-fold upregulated. Therefore, the aim of our study was to evaluate the impact of miR-208b on cardiac protein expression. First, an undirected approach comparing whole RNA sequencing data to miR-walk 2.0 miR-208b 3′-UTR targets revealed 58 potential targets of miR-208b being regulated. We were able to show that miR-208b mimics bind to the 3′ untranslated region (UTR) of voltage-gated calcium channel subunit alpha1 C and Kcnj5, two predicted targets of miR-208b. Additionally, we demonstrated that miR-208b mimics reduce GIRK1/4 channel-dependent thallium ion flux in HL-1 cells. In a second undirected approach we performed mass spectrometry to identify the potential targets of miR-208b. We identified 40 potential targets by comparison to miR-walk 2.0 3′-UTR, 5′-UTR and CDS targets. Among those targets, Rock2 and Ran were upregulated in Western blots of HL-1 cells by miR-208b mimics. In summary, miR-208b targets the mRNAs of proteins involved in the generation of cardiac excitation and propagation, as well as of proteins involved in RNA translocation (Ran) and cardiac hypertrophic response (Rock2).


Author(s):  
Anne Krogh Nøhr ◽  
Kristian Hanghøj ◽  
Genis Garcia Erill ◽  
Zilong Li ◽  
Ida Moltke ◽  
...  

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Panagiotis Moulos

Abstract Background The relentless continuing emergence of new genomic sequencing protocols and the resulting generation of ever larger datasets continue to challenge the meaningful summarization and visualization of the underlying signal generated to answer important qualitative and quantitative biological questions. As a result, the need for novel software able to reliably produce quick, comprehensive, and easily repeatable genomic signal visualizations in a user-friendly manner is rapidly re-emerging. Results recoup is a Bioconductor package for quick, flexible, versatile, and accurate visualization of genomic coverage profiles generated from Next Generation Sequencing data. Coupled with a database of precalculated genomic regions for multiple organisms, recoup offers processing mechanisms for quick, efficient, and multi-level data interrogation with minimal effort, while at the same time creating publication-quality visualizations. Special focus is given on plot reusability, reproducibility, and real-time exploration and formatting options, operations rarely supported in similar visualization tools in a profound way. recoup was assessed using several qualitative user metrics and found to balance the tradeoff between important package features, including speed, visualization quality, overall friendliness, and the reusability of the results with minimal additional calculations. Conclusion While some existing solutions for the comprehensive visualization of NGS data signal offer satisfying results, they are often compromised regarding issues such as effortless tracking of processing and preparation steps under a common computational environment, visualization quality and user friendliness. recoup is a unique package presenting a balanced tradeoff for a combination of assessment criteria while remaining fast and friendly.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Takumi Miura ◽  
Satoshi Yasuda ◽  
Yoji Sato

Abstract Background Next-generation sequencing (NGS) has profoundly changed the approach to genetic/genomic research. Particularly, the clinical utility of NGS in detecting mutations associated with disease risk has contributed to the development of effective therapeutic strategies. Recently, comprehensive analysis of somatic genetic mutations by NGS has also been used as a new approach for controlling the quality of cell substrates for manufacturing biopharmaceuticals. However, the quality evaluation of cell substrates by NGS largely depends on the limit of detection (LOD) for rare somatic mutations. The purpose of this study was to develop a simple method for evaluating the ability of whole-exome sequencing (WES) by NGS to detect mutations with low allele frequency. To estimate the LOD of WES for low-frequency somatic mutations, we repeatedly and independently performed WES of a reference genomic DNA using the same NGS platform and assay design. LOD was defined as the allele frequency with a relative standard deviation (RSD) value of 30% and was estimated by a moving average curve of the relation between RSD and allele frequency. Results Allele frequencies of 20 mutations in the reference material that had been pre-validated by droplet digital PCR (ddPCR) were obtained from 5, 15, 30, or 40 G base pair (Gbp) sequencing data per run. There was a significant association between the allele frequencies measured by WES and those pre-validated by ddPCR, whose p-value decreased as the sequencing data size increased. By this method, the LOD of allele frequency in WES with the sequencing data of 15 Gbp or more was estimated to be between 5 and 10%. Conclusions For properly interpreting the WES data of somatic genetic mutations, it is necessary to have a cutoff threshold of low allele frequencies. The in-house LOD estimated by the simple method shown in this study provides a rationale for setting the cutoff.


2011 ◽  
Vol 9 (6) ◽  
pp. 238-244 ◽  
Author(s):  
Tongwu Zhang ◽  
Yingfeng Luo ◽  
Kan Liu ◽  
Linlin Pan ◽  
Bing Zhang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document