scholarly journals fastp: an ultra-fast all-in-one FASTQ preprocessor

2018 ◽  
Author(s):  
Shifu Chen ◽  
Yanqing Zhou ◽  
Yaru Chen ◽  
Jia Gu

AbstractMotivationQuality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming, and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g., Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient.ResultsWe developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FASTQ data. It also supports unique molecular identifier preprocessing, poly tail trimming, output splitting, and base correction for paired-end data. It can automatically detect adapters for single-end and paired-end FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.Availability and ImplementationThe open-source code and corresponding instructions are available at https://github.com/OpenGene/[email protected]

2021 ◽  
Author(s):  
Soohyun Lee ◽  
Carl Vitzthum ◽  
Burak H. Alver ◽  
Peter J. Park

AbstractSummaryAs the amount of three-dimensional chromosomal interaction data continues to increase, storing and accessing such data efficiently becomes paramount. We introduce Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, an open-source C application to index and query Pairs files. Pairix (also available in Python and R) extends the functionalities of Tabix to paired coordinates data. We have also developed PairsQC, a collapsible HTML quality control report generator for Pairs files.AvailabilityThe format specification and source code are available at https://github.com/4dn-dcic/pairix, https://github.com/4dn-dcic/Rpairix and https://github.com/4dn-dcic/[email protected] or [email protected]


2021 ◽  
Vol 263 (5) ◽  
pp. 1164-1175
Author(s):  
Roberto San Millán-Castillo ◽  
Eduardo Latorre-Iglesias ◽  
Martin Glesser ◽  
Salomé Wanty ◽  
Daniel Jiménez-Caminero ◽  
...  

Sound quality metrics provide an objective assessment of the psychoacoustics of sounds. A wide range of metrics has been already standardised while others remain as active research topics. Calculation algorithms are available in commercial equipment or Matlab scripts. However, they may not present available data on general documentation and validation procedures. Moreover, the use of these tools might be unaffordable for some students and independent researchers. In recent years, the scientific and technical community has been developing uncountable open-source software projects in several knowledge fields. The permission to use, study, modify, improve and distribute open-source software make it extremely valuable. It encourages collaboration and sharing, and thus transparency and continuous improvement of the coding. Modular Sound Quality Integrated Toolbox (MOSQITO) project relies on one of the most popular high-level and free programming languages: Python. The main objective of MOSQITO is to provide a unified and modular framework of key sound quality and psychoacoustics metrics, free and open-source, which supports reproducible testing. Moreover, open-source projects can be efficient learning tools at University degrees. This paper presents the current structure of the toolbox from a technical point of view. Besides, it discusses open-source development contributions to graduates training.


2019 ◽  
Author(s):  
Budiman

During this period continued to develop computer software, programming language was no exception. At the start of the era of low level programming languages, then developed a high level programming language. It is characterized by the appearance of a method of programming offered by a programming language, that is, object-oriented programming (OOP). IDE (Integrated Development Environment) is a computer program that has some facilities that are required in the development of the software. The purpose of the IDEA is to provide all the necessary utilities in building software. As for the type of software text editor that can be used to manipulate the source code hereinafter referred to as the source code of programming languages such as Ultraedit, JediEdit, ClearEdit, cEdit, the Golden Pen, and so on. PuniEdit software is a text-based editor software that can simplify the user through correction, insertion, and modification of the source code. PuniEdit software is built using Borland Delphi 7.0 and SynEdit component. This software can be used for the Pascal programming language, C++ and HTML. In addition, the software PuniEdit can perform management of the token. This PuniEdit software, the user can clearly see every occurrence of the type of token as keywords (reserved word), identifier, operator, and so on.Keywords: Source code, programming language, source code is scanned.


2011 ◽  
Vol 7 ◽  
pp. 93-106
Author(s):  
Robert Szczepanek

Hydrologists need simple, yet powerful, open source framework for developing and testing mathematical models. Such framework should ensure long-term interoperability and high scalability. This can be done by implementation of the existing, already tested standards. At the moment two interesting options exist: Open Modelling Interface (OpenMI) and Object Modeling System (OMS). OpenMI was developed within the Fifth European Framework Programme for integrated watershed management, described in the Water Framework Directive. OpenMI interfaces are available for the C# and Java programming languages. OpenMI Association is now in the process of agreement with Open Geospatial Consortium (OGC), so the spatial standards existing in OpenMI 2.0 should be better implemented in the future. The OMS project is pure Java, object-oriented modeling framework coordinated by the U.S. Department of Agriculture. Big advantage of OMS compared to OpenMI is its simplicity of implementation. On the other hand, OpenMI seems to be more powerful and better suited for hydrological models. Finally, OpenMI model was selected as the base interface for the proposed open source hydrological framework.  The existing hydrological libraries and models focus usually on just one GIS package (HydroFOSS – GRASS) or one operating system (HydroDesktop – Microsoft Windows). The new hydrological framework should break those limitations. To make hydrological models’ implementation as easy as possible, the framework should be based on a simple, high-level computer language. Low and mid-level languages, like Java (SEXTANTE) or C (GRASS, SAGA) were excluded, as too complicated for regular hydrologist. From popular, high-level languages, Python seems to be a good choice. Leading GIS desktop applications – GRASS and QGIS – use Python as second native language, providing well documented API. This way, a Python-based hydrological library could be easily integrated with any GIS package supporting this programming language. As the OpenMI 2.0 standard supported interfaces only for Java and C#, the Python interface for OpenMI standard, presented in this paper, is the first step done towards the open and interoperable hydrological framework. GIS-related issues of the OpenMI 2.0 standard are also outlined and discussed.


2012 ◽  
Vol 20 (4) ◽  
pp. 359-377 ◽  
Author(s):  
Mikołaj Baranowski ◽  
Adam Belloum ◽  
Marian Bubak ◽  
Maciej Malawski

For programming and executing complex applications on grid infrastructures, scientific workflows have been proposed as convenient high-level alternative to solutions based on general-purpose programming languages, APIs and scripts. GridSpace is a collaborative programming and execution environment, which is based on a scripting approach and it extends Ruby language with a high-level API for invoking operations on remote resources. In this paper we describe a tool which enables to convert the GridSpace application source code into a workflow representation which, in turn, may be used for scheduling, provenance, or visualization. We describe how we addressed the issues of analyzing Ruby source code, resolving variable and method dependencies, as well as building workflow representation. The solutions to these problems have been developed and they were evaluated by testing them on complex grid application workflows such as CyberShake, Epigenomics and Montage. Evaluation is enriched by representing typical workflow control flow patterns.


2017 ◽  
Author(s):  
Caleb Lareau ◽  
Martin Aryee

Mumbach et al. recently described HiChIP, a novel protein-mediated chromatin conformation assay that lowers cellular input requirements while simultaneously increasing the yield of informative reads compared to previous methods. To facilitate the dissemination and adoption of this assay, we introduce hichipper (http://aryeelab.org/hichipper), an open-source HiChIP data preprocessing tool, with features that include bias-corrected peak calling, library quality control, DNA loop calling, and output of processed data for downstream analysis and visualization.


2019 ◽  
Vol 1 (1) ◽  
pp. 46-56 ◽  
Author(s):  
Victor R. L. Shen

Those students who major in computer science and/or engineering are required to design program codes in a variety of programming languages. However, many students submit their source codes they get from the Internet or friends with no or few modifications. Detecting the code plagiarisms done by students is very time-consuming and leads to the problems of unfair learning performance evaluation. This paper proposes a novel method to detect the source code plagiarisms by using a high-level fuzzy Petri net (HLFPN) based on abstract syntax tree (AST). First, the AST of each source code is generated after the lexical and syntactic analyses have been done. Second, token sequence is generated based on the AST. Using the AST can effectively detect the code plagiarism by changing the identifier or program statement order. Finally, the generated token sequences are compared with one another using an HLFPN to determine the code plagiarism. Furthermore, the experimental results have indicated that we can make better determination to detect the code plagiarism.


2019 ◽  
Author(s):  
Melissa Y Yan ◽  
Betsy Ferguson ◽  
Benjamin N Bimber

Abstract Summary Large scale genomic studies produce millions of sequence variants, generating datasets far too massive for manual inspection. To ensure variant and genotype data are consistent and accurate, it is necessary to evaluate variants prior to downstream analysis using quality control (QC) reports. Variant call format (VCF) files are the standard format for representing variant data; however, generating summary statistics from these files is not always straightforward. While tools to summarize variant data exist, they generally produce simple text file tables, which still require additional processing and interpretation. VariantQC fills this gap as a user friendly, interactive visual QC report that generates and concisely summarizes statistics from VCF files. The report aggregates and summarizes variants by dataset, chromosome, sample and filter type. The VariantQC report is useful for high-level dataset summary, quality control and helps flag outliers. Furthermore, VariantQC operates on VCF files, so it can be easily integrated into many existing variant pipelines. Availability and implementation DISCVRSeq's VariantQC tool is freely available as a Java program, with the compiled JAR and source code available from https://github.com/BimberLab/DISCVRSeq/. Documentation and example reports are available at https://bimberlab.github.io/DISCVRSeq/.


Author(s):  
Shuai Zhang ◽  
Yi Tay ◽  
Lina Yao ◽  
Bin Wu ◽  
Aixin Sun

Deep learning based recommender systems have been extensively explored in recent years. However, the large number of models proposed each year poses a big challenge for both researchers and practitioners in reproducing the results for further comparisons. Although a portion of papers provides source code, they adopted different programming languages or different deep learning packages, which also raises the bar in grasping the ideas. To alleviate this problem, we released the open source project: \textbf{DeepRec}. In this toolkit, we have implemented a number of deep learning based recommendation algorithms using Python and the widely used deep learning package - Tensorflow. Three major recommendation scenarios: rating prediction, top-N recommendation (item ranking) and sequential recommendation, were considered. Meanwhile, DeepRec maintains good modularity and extensibility to easily incorporate new models into the framework. It is distributed under the terms of the GNU General Public License. The source code is available at github: https://github.com/cheungdaven/DeepRec


A tool that can search over large code corpus directly and list ranked snippets can prove to be an invaluable resource to programmers looking for similar code snippets using natural language queries. It must have a deep understanding of the semantics of source code and queries to evaluate their intent correctly. Over the years, many tools that rely on the textual similarity between source code and query have proven to be ineffective as they fail to learn the high- level semantic understanding of source code and query. While the previous models for code search using deep neural networks do a good job but, most of them only evaluate their models on only a single programming language, mostly Java. In this paper, we propose a novel deep neural network model called Unified Code Net that can handle the intricacies of different programming languages. This model borrows several vital features from different previous models and builds on top of those ideas to make a unified model that can generate document vector embeddings from source code, and using similarity search with the query vector embedding can return the most similar code snippets in any language. This tool can drastically reduce the programmer’s efforts to look for an efficient and viable code snippet for problem at hand which ideally can replace use of search engines for the same.


Sign in / Sign up

Export Citation Format

Share Document