adapter trimming
Recently Published Documents


TOTAL DOCUMENTS

15
(FIVE YEARS 2)

H-INDEX

5
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Sudipta Sankar Bora ◽  
Kuntal Kumar Dey ◽  
Madhusmita Borah ◽  
Mominur Rahman ◽  
Manuranjan Gogoi ◽  
...  

Abstract We employed an Illumina-based high throughput metagenomics sequencing approach to unveil the overall rhizospheric as well as endophytic microbial community associated with an organically grown Camellia population located at the Experimental Tea Garden, Assam Agricultural University, Assam (India). Quality control (i.e. adapter trimming and duplicate removal) followed by de novo assembly revealed the tea endophytic metagenome to contain 24,231 contigs (total 7,771,089 base pairs with an average length of 321 bps) while tea rhizospheric soil metagenome contained 261,965 sequences (total 230537174 base pairs, average length 846). The most prominent rhizobacteria belonged to the genus viz., Bacillus (10.34%), Candidatus Koribacter (8.0%), Candidatus Solibacter (6.35%), Burkholderia (5.18%), Acidobacterium (4.08%), Pseudomonas (3.9%), Streptomyces (3.52%), Bradyrhizobium (2.76%) and Enterobacter (2.56%); while the endosphere was dominated by bacterial genus viz., Serratia (42.3%), Methylobacterium (7.6%), Yersinia (5.4%), Burkholderia (2.2%) etc. The presence of few agronomically important bacterial genuses such as Bradyrhizobium (1.18%), Rhizobium (0.8%), Sinorhizobium (0.34%), Azorhizobium and Flavobacterium (0.17% each) were also detected in the endosphere. KEGG pathway mapping highlighted the presence of microbial metabolite pathway genes related to tyrosine metabolism, tryptophan metabolism, glyoxylate and dicarboxylate metabolism and amino sugar metabolism which play important roles in endophytic activities including survival, growth promotion and host adaptation.


Author(s):  
Ting-Hsuan Wang ◽  
Cheng-Ching Huang ◽  
Jui-Hung Hung

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Shifu Chen ◽  
Changshou He ◽  
Yingqiang Li ◽  
Zhicheng Li ◽  
Charles E Melançon

Abstract In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction and other preprocessing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, Middle East respiratory syndrome and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.


2020 ◽  
Author(s):  
Stephen J. Bush

AbstractRead alignment is the central step of many analytic pipelines that perform SNP calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of SNP calling although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporate thousands of samples, increasing the time and cost required.Using a curated set of 17 Gram-negative bacterial genomes, this study evaluated the impact of four read trimming utilities (Atropos, fastp, Trim Galore, and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP calling pipelines. We found that read trimming made only small, and statistically insignificant, increases in SNP calling accuracy even when using the highest-performing pre-processor, fastp.To extend these findings, we re-analysed > 6500 publicly-archived sequencing datasets from E. coli, M. tuberculosis and S. aureus. Of the approximately 125 million SNPs called across all samples, the same bases were called in 98.8% of cases, irrespective of whether raw reads or trimmed reads were used. However, when using trimmed reads, the proportion of non-homozygous calls (a proxy of false positives) was significantly reduced by approximately 1%. This suggests that trimming rarely alters the set of variant bases called but can affect their level of support. We conclude that read quality- and adapter-trimming add relatively little value to a SNP calling pipeline and may only be necessary if small differences in the absolute number of SNP calls are critical. Read trimming remains routinely performed prior to SNP calling likely out of concern that to do otherwise would substantially increase the number of false positive calls. While historically this may have been the case, our data suggests this concern is now unfounded.Impact StatementShort-read sequencing data is routinely pre-processed before use, to trim off low-quality regions and remove contaminating sequences introduced during its preparation. This cleaning procedure – ‘read trimming’ – is widely assumed to increase the accuracy of any later analyses, although there are relatively few systematic evaluations of trimming strategies and no clear consensus on their efficacy. We used real sequencing data from 17 bacterial genomes to show that several commonly-used read trimming tools, used across a range of stringencies, had only a minimal, statistically insignificant, effect on later SNP calling. To extend these results, we re-analysed > 6500 publicly-archived sequencing datasets, calling SNPs both with and without any read trimming. We found that of the approximately 125 million SNPs within this dataset, 98.8% were identically called irrespective of whether raw reads or trimmed reads were used. Taken together, these results question the necessity of read trimming as a routine pre-processing operation.Data SummaryAll analyses conducted in this study use publicly-available third-party software. All data and parameters necessary to replicate these analyses are provided within the article or through supplementary data files. > 6500 SRA sample accessions, representing Illumina paired-end sequencing data from E. coli, M. tuberculosis and S.aureus, and used to evaluate the impact of fastq pre-processing, are listed in Supplementary Tables 3, 5 and 7.


Author(s):  
Shifu Chen ◽  
Changshou He ◽  
Yingqiang Li ◽  
Zhicheng Li ◽  
Charles E Melançon

ABSTRACTIn this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms, and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input, and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction, and other pre-processing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid SARS-CoV-2 identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, MERS, and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.


Author(s):  
Tomás C. Rodríguez ◽  
Henry E. Pratt ◽  
PengPeng Liu ◽  
Nadia Amrani ◽  
Lihua Julie Zhu

AbstractRNA-guided nucleases (e.g. CRISPR-Cas) are used in a breadth of clinical and basic scientific subfields for the investigation or modification of biological processes. While these modern platforms for site-specific DNA cleavage are highly accurate, some applications (e.g. gene editing therapeutics) cannot tolerate DNA breaks at off-target sites, even at low levels. Thus, it is critically important to determine the genome-wide targeting profile of candidate RNA-guided nucleases prior to use. GUIDE-seq is a high-quality, easy-to-execute molecular method that detects and quantifies off-target cleavage. However, this method may remain costly or inaccessible to some researchers due to its library sequencing and analysis protocols, which require a MiSeq platform that must be preprogramed for non-standard output. Here, we present GS-Preprocess, an open-source containerized software that can use standard raw data output (BCL file format) from any Illumina sequencer to create input for the Bioconductor GUIDEseq off-target profiling package. Single-command GS-Preprocess performs FASTQ demultiplexing, adapter trimming, alignment, and UMI reference construction, improving the ease and accessibility of the GUIDE-seq method for a wide range of researchers.


2019 ◽  
Author(s):  
Andrew J. Robinson ◽  
Elizabeth M. Ross

AbstractWith the recent torrent of high throughput sequencing (HTS) data the necessity for highly efficient algorithms for common tasks is paramount. One task for which the basis for all further analysis of HTS data is initial data quality control, that is, the removal or trimming of poor quality reads from the dataset. Here we present QuAdTrim, a quality control and adapter trimming algorithm for HTS data that is up to 57 times faster and uses less than 0.06% of the memory of other commonly used HTS quality control programs. QuAdTrim will reduce the time and memory required for quality control of HTS data, and in doing, will reduce the computational demands of a fundamental step in HTS data analysis. Additionally, QuAdTrim impliments the removal of homopolymer Gs from the 3’ end of sequence reads, a common error generated on the NovaSeq, NextSeq and iSeq100 platforms.Availability and ImplementationThe source code is freely available on bitbucket under a BSD licence, see COPYING file for details: https://bitbucket.org/arobinson/quadtrimContactAndrew Robinson andrewjrobinson at gmail dot com


2019 ◽  
Vol 5 (4) ◽  
pp. 49
Author(s):  
Xiangfu Zhong ◽  
Fatima Heinicke ◽  
Benedicte A. Lie ◽  
Simon Rayner

A necessary pre-processing data analysis step is the removal of adapter sequences from the raw reads. While most adapter trimming tools require adapter sequence as an essential input, adapter information is often incomplete or missing. This can impact quantification of features, reproducibility of the study and might even lead to erroneous conclusions. Here, we provide examples to highlight the importance of specifying the adapter sequence by demonstrating the effect of using similar but different adapter sequences and identify additional potential sources of errors in the adapter trimming step. Finally, we propose solutions by which users can ensure their small RNA-seq data is fully annotated with adapter information.


2018 ◽  
Author(s):  
Erik L. Clarke ◽  
Louis J. Taylor ◽  
Chunyu Zhao ◽  
Andrew Connell ◽  
Jung-Jin Lee ◽  
...  

AbstractBackgroundAnalysis of mixed microbial communities using metagenomic sequencing experiments requires multiple preprocessing and analytical steps to interpret the microbial and genetic composition of samples. Analytical steps include quality control, adapter trimming, host decontamination, metagenomic classification, read assembly, and alignment to reference genomes.ResultsWe present a modular and user-extensible pipeline called Sunbeam that performs these steps in a consistent and reproducible fashion. It can be installed in a single step, does not require administrative access to the host computer system, and can work with most cluster computing frameworks. We also introduce Komplexity, a software tool to eliminate potentially problematic, low-complexity nucleotide sequences from metagenomic data. Unique components of the Sunbeam pipeline include direct analysis of data from NCBI SRA and an easy-to-use extension framework that enables users to add custom processing or analysis steps directly to the workflow. The pipeline and its extension framework are well documented, in routine use, and regularly updated.ConclusionsSunbeam provides a foundation to build more in-depth analyses and to enable comparisons in metagenomic sequencing experiments by removing problematic low complexity reads and standardizing post-processing and analytical steps. Sunbeam is written in Python using the Snakemake workflow management software and is freely available at github.com/sunbeam-labs/sunbeam under the GPLv3.


2018 ◽  
Author(s):  
Shifu Chen ◽  
Yanqing Zhou ◽  
Yaru Chen ◽  
Jia Gu

AbstractMotivationQuality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for each operation, such as quality control, adapter trimming, and quality filtering. These tools are often insufficiently fast as most are developed using high-level programming languages (e.g., Python and Java) and provide limited multi-threading support. Reading and loading data multiple times also renders preprocessing slow and I/O inefficient.ResultsWe developed fastp as an ultra-fast FASTQ preprocessor with useful quality control and data-filtering features. It can perform quality control, adapter trimming, quality filtering, per-read quality cutting, and many other operations with a single scan of the FASTQ data. It also supports unique molecular identifier preprocessing, poly tail trimming, output splitting, and base correction for paired-end data. It can automatically detect adapters for single-end and paired-end FASTQ data. This tool is developed in C++ and has multi-threading support. Based on our evaluation, fastp is 2–5 times faster than other FASTQ preprocessing tools such as Trimmomatic or Cutadapt despite performing far more operations than similar tools.Availability and ImplementationThe open-source code and corresponding instructions are available at https://github.com/OpenGene/[email protected]


Sign in / Sign up

Export Citation Format

Share Document