scholarly journals ngsReports: An R Package for managing FastQC reports and other NGS related log files

2018 ◽  
Author(s):  
Christopher M. Ward ◽  
Hein To ◽  
Stephen M Pederson

AbstractMotivationHigh throughput next generation sequencing (NGS) has become exceedingly cheap facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and can be found in the outputs of popular bioinformatics tools such as FastQC and Picard. Although these tools provide considerable power when carrying out QC, large sample numbers can make identification of systemic bias a challenge.ResultsWe present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC output as well as that from aligners such as HISAT2, STAR and Bowtie2. Visualization can be carried out across many samples using heatmaps rendered using ggplot2 and plotly. Moreover, these can be displayed in an interactive shiny app or a HTML report. We also provide methods to assess observed GC content in an organism dependent manner for both transcriptomic and genomic datasets. Importantly, hierarchical clustering can be carried out on heatmaps with large sample sizes to quickly identify outliers and batch effects.Availability and ImplementationngsReports is available at https://github.com/UofABioinformaticsHub/ngsReports.

2019 ◽  
Vol 36 (8) ◽  
pp. 2587-2588 ◽  
Author(s):  
Christopher M Ward ◽  
Thu-Hien To ◽  
Stephen M Pederson

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Christina J. Castro ◽  
Rachel L. Marine ◽  
Edward Ramos ◽  
Terry Fei Fan Ng

AbstractViruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approach has surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored. Our results from >15,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This “variant interference” (VI) is highly consistent and reproducible by ten most used de novo assemblers, and occurs independent of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the “rescue” of full viral genomes from fragmented contigs. These results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.


Author(s):  
Dragana Dudić ◽  
Bojana Banović Đeri ◽  
Vesna Pajić ◽  
Gordana Pavlović-Lažetić

Next Generation Sequencing (NGS) analysis has become a widely used method for studying the structure of DNA and RNA, but complexity of the procedure leads to obtaining error-prone datasets which need to be cleansed in order to avoid misinterpretation of data. We address the usage and proper interpretations of characteristic metrics for RNA sequencing (RNAseq) quality control, implemented in and reported by FastQC, and provide a comprehensive guidance for their assessment in the context of total RNAseq quality control of Illumina raw reads. Additionally, we give recommendations how to adequately perform the quality control preprocessing step of raw total RNAseq Illumina reads according to the obtained results of the quality control evaluation step; the aim is to provide the best dataset to downstream analysis, rather than to get better FastQC results. We also tested effects of different preprocessing approaches to the downstream analysis and recommended the most suitable approach.


2021 ◽  
Author(s):  
yanjiang liu ◽  
Xiao Zhu ◽  
Mingli Wu ◽  
Xue Xu ◽  
Zhaoxia Dai ◽  
...  

Abstract Chimonobambusa hirtinoda is a threatened species and only naturally distributed in Doupeng Mountain, Duyun, Guizhou, China. Next-generation sequencing (NGS) is used obtained the complete chloroplast (cp) genome sequence of C. hirtinoda, and then the sequence was assembled and analyze for phylogenetic and evolutionary. We also analyzed comparing the cp genome among Chimonobambusa species with previously published. The complete cp genome of C. hirtinoda has the total length of 139, 561 bp, 38.90% GC content was detected. A total of 130 genes were founded in the cp genome, including 85 protein coding genes, 37 tRNA genes, 8 rRNA. Some genes are missing and the introns occur lost in the cp genome of C. hirtinoda. A total of 48 simple sequence repeat (SSR) were detected and by measuring the codon usage frequency of amino acids, the A/U preference of the third nucleotide in the cp genome of C. hirtinoda was obtained. Furthermore, phylogenetic analysis using complete cp sequences, matk gene exhibited genetic relationship within the Chimonobambusa genus.


2019 ◽  
Vol 35 (21) ◽  
pp. 4419-4421 ◽  
Author(s):  
Sun Ah Kim ◽  
Myriam Brossard ◽  
Delnaz Roshandel ◽  
Andrew D Paterson ◽  
Shelley B Bull ◽  
...  

Abstract Summary For the analysis of high-throughput genomic data produced by next-generation sequencing (NGS) technologies, researchers need to identify linkage disequilibrium (LD) structure in the genome. In this work, we developed an R package gpart which provides clustering algorithms to define LD blocks or analysis units consisting of SNPs. The visualization tool in gpart can display the LD structure and gene positions for up to 20 000 SNPs in one image. The gpart functions facilitate construction of LD blocks and SNP partitions for vast amounts of genome sequencing data within reasonable time and memory limits in personal computing environments. Availability and implementation The R package is available at https://bioconductor.org/packages/gpart. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 47 (21) ◽  
pp. e135-e135
Author(s):  
Maxim Ivanov ◽  
Mikhail Ivanov ◽  
Artem Kasianov ◽  
Ekaterina Rozhavskaya ◽  
Sergey Musienko ◽  
...  

Abstract As the use of next-generation sequencing (NGS) for the Mendelian diseases diagnosis is expanding, the performance of this method has to be improved in order to achieve higher quality. Typically, performance measures are considered to be designed in the context of each application and, therefore, account for a spectrum of clinically relevant variants. We present EphaGen, a new computational methodology for bioinformatics quality control (QC). Given a single NGS dataset in BAM format and a pre-compiled VCF-file of targeted clinically relevant variants it associates this dataset with a single arbiter parameter. Intrinsically, EphaGen estimates the probability to miss any variant from the defined spectrum within a particular NGS dataset. Such performance measure virtually resembles the diagnostic sensitivity of given NGS dataset. Here we present case studies of the use of EphaGen in context of BRCA1/2 and CFTR sequencing in a series of 14 runs across 43 blood samples and 504 publically available NGS datasets. EphaGen is superior to conventional bioinformatics metrics such as coverage depth and coverage uniformity. We recommend using this software as a QC step in NGS studies in the clinical context. Availability: https://github.com/m4merg/EphaGen or https://hub.docker.com/r/m4merg/ephagen.


2019 ◽  
Author(s):  
Charles Karavina ◽  
Jacques Davy Ibaba ◽  
Augustine Gubba

Abstract Objectives: Plant-infecting viruses remain a serious challenge towards achieving food security worldwide. Cucurbits, in Zimbabwe, like in the other parts of the world, are used in various ways. A small-scaled cucurbit virus survey was conducted in Zimbabwe during the 2014 and 2015 growing seasons. Cucurbit leaf samples displaying virus-like symptoms were collected and stored until analysis. The samples were then subjected to next-generation sequencing (NGS). The data generated from NGS were analysed using genomics technologies. Zucchini shoestring virus (ZSSV), a cucurbit-infecting potyvirus previously described in South Africa was one of the viruses identified. The genomes of three ZSSV isolates from Zimbabwe are described in this note. Results: The three ZSSV isolates had the same genome size of 10297 bp excluding the polyA tail with a 43% GC content. The large open reading frame (ORF) was found at positions 69 to 10106 on the genome and encodes a 3345 amino acids long polyprotein which had the same cleavage site sequences as those described on the South African isolates except for the P1-pro site. The smaller ORF, also called the pretty interesting Potyviridae ORF, was located at positions 3611 to 3793 on the genomes for all three ZSSV isolates.


2021 ◽  
Vol 43 (2) ◽  
pp. 845-867
Author(s):  
Goldin John ◽  
Nikhil Shri Sahajpal ◽  
Ashis K. Mondal ◽  
Sudha Ananth ◽  
Colin Williams ◽  
...  

This review discusses the current testing methodologies for COVID-19 diagnosis and explores next-generation sequencing (NGS) technology for the detection of SARS-CoV-2 and monitoring phylogenetic evolution in the current COVID-19 pandemic. The review addresses the development, fundamentals, assay quality control and bioinformatics processing of the NGS data. This article provides a comprehensive review of the obstacles and opportunities facing the application of NGS technologies for the diagnosis, surveillance, and study of SARS-CoV-2 and other infectious diseases. Further, we have contemplated the opportunities and challenges inherent in the adoption of NGS technology as a diagnostic test with real-world examples of its utility in the fight against COVID-19.


2018 ◽  
Author(s):  
Jelena Telenius ◽  
Jim R. Hughes ◽  

ABSTRACTWith decreasing cost of next-generation sequencing (NGS), we are observing a rapid rise in the volume of ‘big data’ in academic research, healthcare and drug discovery sectors. The present bottleneck for extracting value from these ‘big data’ sets is data processing and analysis. Considering this, there is still a lack of reliable, automated and easy to use tools that will allow experimentalists to assess the quality of the sequenced libraries and explore the data first hand, without the need of investing a lot of time of computational core analysts in the early stages of analysis.NGseqBasic is an easy-to-use single-command analysis tool for chromatin accessibility (ATAC, DNaseI) and ChIP sequencing data, providing support to also new techniques such as low cell number sequencing and Cut-and-Run. It takes in fastq, fastq.gz or bam files, conducts all quality control, trimming and mapping steps, along with quality control and data processing statistics, and combines all this to a single-click loadable UCSC data hub, with integral statistics html page providing detailed reports from the analysis tools and quality control metrics. The tool is easy to set up, and no installation is needed. A wide variety of parameters are provided to fine-tune the analysis, with optional setting to generate DNase footprint or high resolution ChIP-seq tracks. A tester script is provided to help in the setup, along with a test data set and downloadable example user cases.NGseqBasic has been used in the routine analysis of next generation sequencing (NGS) data in high-impact publications 1,2. The code is actively developed, and accompanied with Git version control and Github code repository. Here we demonstrate NGseqBasic analysis and features using DNaseI-seq data from GSM689849, and CTCF-ChIP-seq data from GSM2579421, as well as a Cut-and-Run CTCF data set GSM2433142, and provide the one-click loadable UCSC data hubs generated by the tool, allowing for the ready exploration of the run results and quality control files generated by the tool.AvailabilityDownload, setup and help instructions are available on the NGseqBasic web site http://userweb.molbiol.ox.ac.uk/public/telenius/NGseqBasicManual/external/Bioconda users can load the tool as library “ngseqbasic”. The source code with Git version control is available in https://github.com/Hughes-Genome-Group/NGseqBasic/[email protected]


Sign in / Sign up

Export Citation Format

Share Document