A systematic NGS-based approach for contaminant detection and functional inference

Mapping Intimacies ◽

10.1101/741934 ◽

2019 ◽

Author(s):

Sung-Joon Park ◽

Satoru Onizuka ◽

Masahide Seki ◽

Yutaka Suzuki ◽

Takanori Iwata ◽

...

Keyword(s):

Large Scale ◽

Precise Determination ◽

Host Cells ◽

Rna Seq ◽

Functional Inference ◽

Apoptotic Pathways ◽

Multiple Species ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractBackgroundMicrobial contamination impedes successful biological and biomedical research. Computational approaches utilizing next-generation sequencing (NGS) data offer promising diagnostics to assess the presence of contaminants. However, as host cells are often contaminated by multiple microorganisms, these approaches require careful attention to intra- and interspecies sequence similarities, which have not yet been fully addressed.ResultsWe present a computational approach that rigorously investigates the genomic origins of sequenced reads, including those mapped to multiple species that have been discarded in previous studies. Through the analysis of large-scale synthetic and public NGS samples, we approximated that 1,000−100,000 microbial reads prevail when one million host reads are sequenced by RNA-seq. The microbe catalog we established included Cutibacterium as a prevalent contaminant, suggesting that contamination mostly originates from the laboratory environment. Importantly, by applying a systematic method to infer the functional impact of contamination, we revealed that host-contaminant interactions cause profound changes in the host molecular landscapes, as exemplified by changes in inflammatory and apoptotic pathways during Mycoplasma infection.ConclusionsThese findings reinforce the concept that precise determination of the origins and functional impacts of contamination is imperative for quality research and illustrate the usefulness of the proposed approach to comprehensively characterize contamination landscapes.

Download Full-text

A systematic sequencing-based approach for microbial contaminant detection and functional inference

BMC Biology ◽

10.1186/s12915-019-0690-0 ◽

2019 ◽

Vol 17 (1) ◽

Cited By ~ 1

Author(s):

Sung-Joon Park ◽

Satoru Onizuka ◽

Masahide Seki ◽

Yutaka Suzuki ◽

Takanori Iwata ◽

...

Keyword(s):

Large Scale ◽

Microbial Contamination ◽

Precise Determination ◽

Computational Method ◽

Host Cells ◽

Functional Inference ◽

Apoptotic Pathways ◽

Molecular Landscape ◽

Multiple Species ◽

Ngs Data

Abstract Background Microbial contamination poses a major difficulty for successful data analysis in biological and biomedical research. Computational approaches utilizing next-generation sequencing (NGS) data offer promising diagnostics to assess the presence of contaminants. However, as host cells are often contaminated by multiple microorganisms, these approaches require careful attention to intra- and interspecies sequence similarities, which have not yet been fully addressed. Results We present a computational approach that rigorously investigates the genomic origins of sequenced reads, including those mapped to multiple species that have been discarded in previous studies. Through the analysis of large-scale synthetic and public NGS samples, we estimate that 1000–100,000 contaminating microbial reads are detected per million host reads sequenced by RNA-seq. The microbe catalog we established included Cutibacterium as a prevalent contaminant, suggesting that contamination mostly originates from the laboratory environment. Importantly, by applying a systematic method to infer the functional impact of contamination, we revealed that host-contaminant interactions cause profound changes in the host molecular landscapes, as exemplified by changes in inflammatory and apoptotic pathways during Mycoplasma infection of lymphoma cells. Conclusions We provide a computational method for profiling microbial contamination on NGS data and suggest that sources of contamination in laboratory reagents and the experimental environment alter the molecular landscape of host cells leading to phenotypic changes. These findings reinforce the concept that precise determination of the origins and functional impacts of contamination is imperative for quality research and illustrate the usefulness of the proposed approach to comprehensively characterize contamination landscapes.

Download Full-text

Target variant detection in leukemia using unaligned RNA-Seq reads

10.1101/295808 ◽

2018 ◽

Cited By ~ 1

Author(s):

Eric Olivier Audemard ◽

Patrick Gendron ◽

Vincent-Philippe Lavallée ◽

Josée Hébert ◽

Guy Sauvageau ◽

...

Keyword(s):

Variant Calling ◽

The Cancer Genome Atlas ◽

Rna Seq ◽

Read Mapping ◽

Targeted Mutation ◽

Cancer Genome Atlas ◽

Computationally Intensive ◽

And Performance ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractMutations identified in each Acute Myeloid Leukemia (AML) patients are useful for prognosis and to select targeted therapies. Detection of such mutations by the analysis of Next-Generation Sequencing (NGS) data requires a computationally intensive read mapping step and application of several variant calling methods. Targeted mutation identification drastically shifts the usual tradeoff between accuracy and performance by concentrating all computations over a small portion of sequence space. Here, we present km, an efficient approach leveraging k-mer decomposition of reads to identify targeted mutations. Our approach is versatile, as it can detect single-base mutations, several types of insertions and deletions, as well as fusions. We used two independent AML cohorts (The Cancer Genome Atlas and Leucegene), to show that mutation detection bykmis fast, accurate and mainly limited by sequencing depth. Therefore,kmallows to establish fast diagnostics from NGS data, and could be suitable for clinical applications.

Download Full-text

Accuracy of programs for the determination of HLA alleles from NGS data

10.1101/183038 ◽

2017 ◽

Author(s):

Antti Larjo ◽

Robert Eveleigh ◽

Elina Kilpeläinen ◽

Tony Kwan ◽

Tomi Pastinen ◽

...

Keyword(s):

Human Leukocyte ◽

Hla Typing ◽

Ensemble Prediction ◽

Hla Alleles ◽

Leukocyte Antigen ◽

Peptide Antigens ◽

Hla Genes ◽

Next Generation Sequencing Ngs ◽

Ngs Data

AbstractThe human leukocyte antigen (HLA) genes code for proteins that play a central role in the function of the immune system by presenting peptide antigens to T cells. As HLA genes show extremely high genetic polymorphism, HLA typing on the allele level is demanding and is based on DNA sequencing. Determination of HLA alleles is warranted as many HLA alleles are major genetic factors that confer susceptibility to autoimmune diseases and is important for the matching of HLA alleles in transplantation. Here, we compared the accuracy of several published HLA-typing algorithms that are based on next generation sequencing (NGS) data. As genome screens are becoming increasingly routine in research, we wanted to test how well HLA alleles can be deduced from genome screens not designed for HLA typing. The accuracies were assessed using datasets consisting of NGS data produced using the ImmunoSEQ platform, including the full 4 Mbp HLA segment, from 94 stem cell transplantation patients and exome sequences from the 1000 Genomes collection. When used with the default settings none of the methods gave perfect results for all the genes and samples. However, we found that ensemble prediction of the results or modifications of the settings could be used to improve accuracy. Most of the algorithms did not perform very well for the exome-only data. The results indicate that the use of these algorithms for accurate HLA allele determination based on NGS data is not straightforward.

Download Full-text

MegaPath: sensitive and rapid pathogen detection using metagenomic NGS data

BMC Genomics ◽

10.1186/s12864-020-06875-6 ◽

2020 ◽

Vol 21 (S6) ◽

Author(s):

Chi-Ming Leung ◽

Dinghua Li ◽

Yan Xin ◽

Wai-Chun Law ◽

Yifan Zhang ◽

...

Keyword(s):

Pathogen Detection ◽

High Sensitivity ◽

Reference Sequence ◽

Close Relative ◽

Read Mapping ◽

Seeding Strategy ◽

Multiple Species ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Unique Species

Abstract Background Next-generation sequencing (NGS) enables unbiased detection of pathogens by mapping the sequencing reads of a patient sample to the known reference sequence of bacteria and viruses. However, for a new pathogen without a reference sequence of a close relative, or with a high load of mutations compared to its predecessors, read mapping fails due to a low similarity between the pathogen and reference sequence, which in turn leads to insensitive and inaccurate pathogen detection outcomes. Results We developed MegaPath, which runs fast and provides high sensitivity in detecting new pathogens. In MegaPath, we have implemented and tested a combination of polishing techniques to remove non-informative human reads and spurious alignments. MegaPath applies a global optimization to the read alignments and reassigns the reads incorrectly aligned to multiple species to a unique species. The reassignment not only significantly increased the number of reads aligned to distant pathogens, but also significantly reduced incorrect alignments. MegaPath implements an enhanced maximum-exact-match prefix seeding strategy and a SIMD-accelerated Smith-Waterman algorithm to run fast. Conclusions In our benchmarks, MegaPath demonstrated superior sensitivity by detecting eight times more reads from a low-similarity pathogen than other tools. Meanwhile, MegaPath ran much faster than the other state-of-the-art alignment-based pathogen detection tools (and compariable with the less sensitivity profile-based pathogen detection tools). The running time of MegaPath is about 20 min on a typical 1 Gb dataset.

Download Full-text

Of the Right and Left Hand (1646): Sir Thomas Browne's historical survey of ‘handedness”

Irish Journal of Psychological Medicine ◽

10.1017/s0790966700008788 ◽

2005 ◽

Vol 22 (1) ◽

pp. 30-32

Author(s):

Caoimhghin S Breathnach

Keyword(s):

Left Hemisphere ◽

Large Scale ◽

Precise Determination ◽

Antidepressant Effect ◽

Hemispheric Asymmetries ◽

Left Hand ◽

Left Handedness ◽

The Right ◽

Reproducible Manner

‘Handedness’ as an expression of cerebral lateralisation is valuable in analysis of hemispheric asymmetries, carrying implications for implementation (as well as interpretation) of complex cognitive functions. In recent decades it has become possible to categorise handedness in a reproducible manner and, independently, to estimate accurately the degree of language lateralisation of the brain. These advances have re-focussed attention on cerebral organisation and hemispheric asymmetries, and there is now considerable interest in the neuropsychology of left-handedness. Because of procedural and ethical constraints there are relatively few large scale studies on language dominance, whereas handedness has been studied extensively in recent decades. Language is represented in the left hemisphere in all but 1% of right-handers, and in 60% of left-handers; in 15% of right-handers speech representation was bilateral.Precise determination of handedness or lateralisation does not appear to have been assessed in major studies of electroconvulsive therapy (ECT). Results in 29 reports, when the electrodes were placed over either the non-dominant or both hemispheres, were tabulated and briefly discussed by d'Elia and Raotma, but the criterion of lateral dominance assignment was not clearly specified; the unilateral and bilateral placements were equally efficacious in their antidepressant effect. d'Elia, who introduced unilateral therapy in 1970, accepted the assumption that the left hemisphere was ‘dominant’, but later workers were more circumspect.

Download Full-text

SMART-RDA: A Galaxy Workflow for RNA-Seq Data Analysis

KnE Life Sciences ◽

10.18502/kls.v3i4.703 ◽

2017 ◽

Vol 3 (4) ◽

pp. 186

Author(s):

Redi Aditama ◽

Zulfikar Achmad Tanjung ◽

Widyartini Made Sudania ◽

Toni Liwang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Analysis Tool ◽

Rna Seq ◽

Expression Studies ◽

Galaxy Server ◽

Local Facility ◽

Computational Workflow ◽

Gene Expression Studies ◽

Next Generation Sequencing Ngs

RNA-seq using the Next Generation Sequencing (NGS) approach is a common technology to analyze large-scale RNA transcript data for gene expression studies. However, an appropriate bioinformatics tool is needed to analyze a large amount of transcriptomes data from RNA-seq experiment. The aim of this study was to construct a system that can be easily applied to analyze RNA-seq data. RNA-seq analysis tool as SMART-RDA was constructed in this study. It is a computational workflow based on Galaxy framework to be used for analyzing RNA-seq raw data into gene expression information. This workflow was adapted from a well-known Tuxedo Protocol for RNA-seq analysis with some modifications. Expression value from each transcriptome was quantitatively stated as Fragments Per Kilobase of exon per Million fragments (FPKM). RNA-seq data of sterile and fertile oil palm (Pisifera) pollens derived from Sequence Read Archive (SRA) NCBI were used to test this workflow in local facility Galaxy server. The results showed that differentially gene expression in pollens might be responsible for sterile and fertile characteristics in palm oil Pisifera.Keywords: FPKM; Galaxy workflow; Gene expression; RNA sequencing.

Download Full-text

Easymap: A User-Friendly Software Package for Rapid Mapping-by-Sequencing of Point Mutations and Large Insertions

Frontiers in Plant Science ◽

10.3389/fpls.2021.655286 ◽

2021 ◽

Vol 12 ◽

Author(s):

Samuel Daniel Lup ◽

David Wilson-Sánchez ◽

Sergio Andreu-Sánchez ◽

José Luis Micol

Keyword(s):

Point Mutations ◽

Rapid Identification ◽

Graphical Interface ◽

Rna Seq ◽

Source Program ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Ngs Data ◽

Mapping By Sequencing ◽

Generation Sequencing

Mapping-by-sequencing strategies combine next-generation sequencing (NGS) with classical linkage analysis, allowing rapid identification of the causal mutations of the phenotypes exhibited by mutants isolated in a genetic screen. Computer programs that analyze NGS data obtained from a mapping population of individuals derived from a mutant of interest to identify a causal mutation are available; however, the installation and usage of such programs requires bioinformatic skills, modifying or combining pieces of existing software, or purchasing licenses. To ease this process, we developed Easymap, an open-source program that simplifies the data analysis workflows from raw NGS reads to candidate mutations. Easymap can perform bulked segregant mapping of point mutations induced by ethyl methanesulfonate (EMS) with DNA-seq or RNA-seq datasets, as well as tagged-sequence mapping for large insertions, such as transposons or T-DNAs. The mapping analyses implemented in Easymap have been validated with experimental and simulated datasets from different plant and animal model species. Easymap was designed to be accessible to all users regardless of their bioinformatics skills by implementing a user-friendly graphical interface, a simple universal installation script, and detailed mapping reports, including informative images and complementary data for assessment of the mapping results. Easymap is available at http://genetics.edu.umh.es/resources/easymap; its Quickstart Installation Guide details the recommended procedure for installation.

Download Full-text

Genomic evidence for divergent co-infections of SARS-CoV-2 lineages

10.1101/2021.09.03.458951 ◽

2021 ◽

Author(s):

Hang-Yu Zhou ◽

Ye-Xiao Cheng ◽

Lin Xu ◽

Jia-Ying Li ◽

Chen-Yue Tao ◽

...

Keyword(s):

Next Generation Sequencing ◽

Large Scale ◽

Rna Viruses ◽

Average Rate ◽

Genomic Analysis ◽

Distribution Method ◽

Reference Framework ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Recently, patients co-infected by two SARS-CoV-2 lineages have been sporadically reported. Concerns are raised because previous studies have demonstrated co-infection may contribute to the recombination of RNA viruses and cause severe clinic symptoms. In this study, we have estimated the compositional lineage(s), tendentiousness, and frequency of co-infection events in population from a large-scale genomic analysis for SARS-CoV-2 patients. SARS-CoV-2 lineage(s) infected in each sample have been recognized from the assignment of within-host site variations into lineage-defined feature variations by introducing a hypergeometric distribution method. Of all the 29,993 samples, 53 (~0.18%) co-infection events have been identified. Apart from 52 co-infections with two SARS-CoV-2 lineages, one sample with co-infections of three SARS-CoV-2 lineages was firstly identified. As expected, the co-infection events mainly happened in the regions where have co-existed more than two dominant SARS-CoV-2 lineages. However, co-infection of two sub-lineages in Delta lineage were detected as well. Our results provide a useful reference framework for the high throughput detecting of SARS-CoV-2 co-infection events in the Next Generation Sequencing (NGS) data. Although low in average rate, the co-infection events showed an increasing tendency with the increased diversity of SARS-CoV-2. And considering the large base of SARS-CoV-2 infections globally, co-infected patients would be a nonnegligible population. Thus, more clinical research is urgently needed on these patients.

Download Full-text

Memory-driven computing accelerates genomic data processing

10.1101/519579 ◽

2019 ◽

Cited By ~ 4

Author(s):

Matthias Becker ◽

Milind Chabbi ◽

Stefanie Warnat-Herresthal ◽

Kathrin Klee ◽

Jonas Schulte-Schrepping ◽

...

Keyword(s):

Energy Consumption ◽

Data Processing ◽

Data Privacy ◽

Large Scale ◽

Transcriptome Assembly ◽

Primary Data ◽

Attractive Alternative ◽

Rna Seq ◽

Local Data ◽

Ngs Data

Next generation sequencing (NGS) is the driving force behind precision medicine and is revolutionizing most, if not all, areas of the life sciences. Particularly when targeting the major common diseases, an exponential growth of NGS data is foreseen for the next decades. This enormous increase of NGS data and the need to process the data quickly for real-world applications requires to rethink our current compute infrastructures. Here we provide evidence that memory-driven computing (MDC), a novel memory-centric hardware architecture, is an attractive alternative to current processor-centric compute infrastructures. To illustrate how MDC can change NGS data handling, we used RNA-seq assembly and pseudoalignment followed by quantification as two first examples. Adapting transcriptome assembly pipelines for MDC reduced compute time by 5.9-fold for the first step (SAMtools). Even more impressive, pseudoalignment by near-optimal probabilistic RNA-seq quantification (kallisto) was accelerated by more than two orders of magnitude with identical accuracy and indicated 66% reduced energy consumption. One billion RNA-seq reads were processed in just 92 seconds. Clearly, MDC simultaneously reduces data processing time and energy consumption. Together with the MDC-inherent solutions for local data privacy, a new compute model can be projected pushing large scale NGS data processing and primary data analytics closer to the edge by directly combining high-end sequencers with local MDC, thereby also reducing movement of large raw data to central cloud storage. We further envision that other data-rich areas will similarly benefit from this new memory-centric compute architecture.

Download Full-text

Nuclease-mediated depletion biases in ribosome footprint profiling libraries

10.1101/2020.03.30.017061 ◽

2020 ◽

Cited By ~ 3

Author(s):

Boris Zinshteyn ◽

Jamie R Wangen ◽

Boyang Hua ◽

Rachel Green

Keyword(s):

Ribosomal Rna ◽

High Throughput Sequencing ◽

Precise Determination ◽

Ribosome Profiling ◽

Rna Seq ◽

Rna Sequences ◽

Ribosomal Rna Sequences ◽

Rna Fragments ◽

Commercial Applications

AbstractRibosome footprint profiling is a high throughput sequencing based technique that provides detailed and global views of translation in living cells. An essential part of this technology is removal of unwanted, normally very abundant, ribosomal RNA sequences that dominate libraries and increase sequencing costs. The most effective commercial solution (Ribo-Zero) has been discontinued and a number of new, experimentally distinct commercial applications have emerged on the market. Here we evaluated several commercially available alternatives designed for RNA-seq of human samples and find them unsuitable for ribosome footprint profiling. We instead recommend the use of custom-designed biotinylated oligos, which were widely used in early ribosome profiling studies. Importantly, we warn that depletion solutions based on targeted nuclease cleavage significantly perturb the high-resolution information that can be derived from the data, and thus do not recommend their use for any applications that require precise determination of the ends of RNA fragments.

Download Full-text