Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

ISRN Bioinformatics ◽

10.1155/2013/481545 ◽

2013 ◽

Vol 2013 ◽

pp. 1-8 ◽

Cited By ~ 9

Author(s):

Shanrong Zhao ◽

Kurt Prenger ◽

Lance Smith

Keyword(s):

Data Analysis ◽

Large Scale ◽

Scale Up ◽

Local Environment ◽

Transcriptome Profiling ◽

Cost Effective ◽

Rna Seq ◽

Practical Challenge ◽

Amazon Web Services ◽

Computational Resources

RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.

Download Full-text

Direct detection of SARS-CoV-2 using non-commercial RT-LAMP reagents on heat-inactivated samples

Scientific Reports ◽

10.1038/s41598-020-80352-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Alisa Alekseenko ◽

Donal Barrett ◽

Yerma Pareja-Sanchez ◽

Rebecca J. Howard ◽

Emilia Strandback ◽

...

Keyword(s):

Large Scale ◽

Direct Detection ◽

Dna Polymerases ◽

Scale Up ◽

Lamp Assay ◽

Cost Effective ◽

Reaction Conditions ◽

Clinical Patient ◽

Reverse Transcriptases ◽

Large Scale Testing

AbstractRT-LAMP detection of SARS-CoV-2 has been shown to be a valuable approach to scale up COVID-19 diagnostics and thus contribute to limiting the spread of the disease. Here we present the optimization of highly cost-effective in-house produced enzymes, and we benchmark their performance against commercial alternatives. We explore the compatibility between multiple DNA polymerases with high strand-displacement activity and thermostable reverse transcriptases required for RT-LAMP. We optimize reaction conditions and demonstrate their applicability using both synthetic RNA and clinical patient samples. Finally, we validate the optimized RT-LAMP assay for the detection of SARS-CoV-2 in unextracted heat-inactivated nasopharyngeal samples from 184 patients. We anticipate that optimized and affordable reagents for RT-LAMP will facilitate the expansion of SARS-CoV-2 testing globally, especially in sites and settings where the need for large scale testing cannot be met by commercial alternatives.

Download Full-text

Gene Expression Imputation with Generative Adversarial Imputation Nets

10.1101/2020.06.09.141689 ◽

2020 ◽

Author(s):

Ramon Viñas ◽

Tiago Azevedo ◽

Eric R. Gamazon ◽

Pietro Liò

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biological Significance ◽

Predictive Performance ◽

Cost Effective ◽

Rna Seq ◽

Comprehensive Collection ◽

Genomic Studies ◽

Biological Discovery ◽

Cancer Types

AbstractA question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

Download Full-text

Biological Synthesis of Silver Nanoparticles and their Biomedical Activity: A Review

Current Green Chemistry ◽

10.2174/2213346109666211217091042 ◽

2021 ◽

Vol 09 ◽

Author(s):

Sarvat Zafar ◽

Aiman Zafar ◽

Fakhra Jabeen ◽

Miad Ali Siddiq

Keyword(s):

Silver Nanoparticles ◽

Large Scale ◽

Scale Up ◽

Cost Effective ◽

Biological Properties ◽

Single Step ◽

Biological Synthesis ◽

Nanosized Particles ◽

Environment Friendly ◽

Physical And Chemical

: Nanotechnology studies the various phenomena of physio-chemical procedures and biological properties for the generation of nanosized particles, and their rising challenges in the various sectors, like medicine, engineering, agriculture, electronic, and environmental studies. The nanosized particles exhibit good anti-microbial, anti-inflammatory, cytotoxic, drug delivery, anti-parasitic, anti-coagulant and catalytic properties because of their unique dimensions with large surface area, chemical stability and higher binding density for the accumulation of various bio-constituents on their surfaces. Biological approaches for the synthesis of silver nanoparticles (AgNPs) have been reviewed because it is an easy and single-step protocol and a viable substitute for the synthetic chemical-based procedures. Physical and chemical approaches for the production of AgNPs are also mentioned herein. Biological synthesis has drawn attention because it is cost-effective, faster, non-pathogenic, environment-friendly, easy to scale-up for large-scale synthesis, and having no demand for usage of high pressure, energy, temperature, or noxious chemical ingredients, and safe for human therapeutic use. Therefore, the collaboration of nanomaterials with bio-green approaches could extend the utilization of biological and cytological properties compatible with AgNPs. In this perspective, there is an immediate need to develop ecofriendly and biocompatible techniques, which strengthen efficacy against microbes and minimize toxicity for human cells. The present study introduces the biological synthesis of silver nanoparticles, and their potential biomedical applications have also been reviewed.

Download Full-text

scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa082 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Kaikun Xie ◽

Yu Huang ◽

Feng Zeng ◽

Zehua Liu ◽

Ting Chen

Keyword(s):

Single Cell ◽

Large Scale ◽

Developmental Trajectories ◽

Cell Types ◽

Random Projection ◽

Good Representation ◽

Rna Seq ◽

Unsupervised Deep Learning ◽

High Level ◽

Computational Resources

Abstract Recent advancements in both single-cell RNA-sequencing technology and computational resources facilitate the study of cell types on global populations. Up to millions of cells can now be sequenced in one experiment; thus, accurate and efficient computational methods are needed to provide clustering and post-analysis of assigning putative and rare cell types. Here, we present a novel unsupervised deep learning clustering framework that is robust and highly scalable. To overcome the high level of noise, scAIDE first incorporates an autoencoder-imputation network with a distance-preserved embedding network (AIDE) to learn a good representation of data, and then applies a random projection hashing based k-means algorithm to accommodate the detection of rare cell types. We analyzed a 1.3 million neural cell dataset within 30 min, obtaining 64 clusters which were mapped to 19 putative cell types. In particular, we further identified three different neural stem cell developmental trajectories in these clusters. We also classified two subpopulations of malignant cells in a small glioblastoma dataset using scAIDE. We anticipate that scAIDE would provide a more in-depth understanding of cell development and diseases.

Download Full-text

Regional Analysis of the Brain Transcriptome in Mice Bred for High and Low Methamphetamine Consumption

Brain Sciences ◽

10.3390/brainsci9070155 ◽

2019 ◽

Vol 9 (7) ◽

pp. 155 ◽

Cited By ~ 6

Author(s):

Hitzemann ◽

Iancu ◽

Reed ◽

Baba ◽

Lockwood ◽

...

Keyword(s):

Large Scale ◽

Drug Exposure ◽

Regional Analysis ◽

Drug Effects ◽

Network Connectivity ◽

Transcriptome Profiling ◽

Rna Seq ◽

Reward Circuitry ◽

Brain Transcriptome ◽

Differential Gene

Transcriptome profiling can broadly characterize drug effects and risk for addiction in the absence of drug exposure. Modern large-scale molecular methods, including RNA-sequencing (RNA-Seq), have been extensively applied to alcohol-related disease traits, but rarely to risk for methamphetamine (MA) addiction. We used RNA-Seq data from selectively bred mice with high or low risk for voluntary MA intake to construct coexpression and cosplicing networks for differential risk. Three brain reward circuitry regions were explored, the nucleus accumbens (NAc), prefrontal cortex (PFC), and ventral midbrain (VMB). With respect to differential gene expression and wiring, the VMB was more strongly affected than either the PFC or NAc. Coexpression network connectivity was higher in the low MA drinking line than in the high MA drinking line in the VMB, oppositely affected in the NAc, and little impacted in the PFC. Gene modules protected from the effects of selection may help to eliminate certain mechanisms from significant involvement in risk for MA intake. One such module was enriched in genes with dopamine-associated annotations. Overall, the data suggest that mitochondrial function and glutamate-mediated synaptic plasticity have key roles in the outcomes of selective breeding for high versus low levels of MA intake.

Download Full-text

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Genes ◽

10.3390/genes11010053 ◽

2020 ◽

Vol 11 (1) ◽

pp. 53

Author(s):

Zaid Al-Ars ◽

Saiyi Wang ◽

Hamid Mushtaq

Keyword(s):

Low Cost ◽

Computing Time ◽

Scale Up ◽

Variant Calling ◽

Computation Time ◽

Apache Spark ◽

Rna Seq ◽

Single Node ◽

Practical Applications ◽

Computational Resources

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

Download Full-text

Interoperable and scalable data analysis with microservices: Applications in Metabolomics

10.1101/213603 ◽

2017 ◽

Cited By ~ 2

Author(s):

Payam Emami Khoonsari ◽

Pablo Moreno ◽

Sven Bergmann ◽

Joachim Burman ◽

Marco Capuccini ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Large Scale ◽

Metabolite Identification ◽

Access Point ◽

Scientific Discipline ◽

Resonance Spectroscopy ◽

Magnetic Resonance Spectroscopy Study ◽

Analysis Workflow ◽

Computational Resources

Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed in parallel using the Kubernetes container orchestrator. The access point is a virtual research environment which can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and established workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry studies, one nuclear magnetic resonance spectroscopy study and one fluxomics study, showing that the method scales dynamically with increasing availability of computational resources. We achieved a complete integration of the major software suites resulting in the first turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, multivariate statistics, and metabolite identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science.

Download Full-text

Neuroscience Cloud Analysis As a Service

10.1101/2020.06.11.146746 ◽

2020 ◽

Cited By ~ 2

Author(s):

Taiga Abe ◽

Ian Kinsella ◽

Shreya Saxena ◽

Liam Paninski ◽

John P. Cunningham

Keyword(s):

Data Analysis ◽

Open Source ◽

Large Scale ◽

Ease Of Use ◽

Cutting Edge ◽

Analysis Tools ◽

Large Scale Computing ◽

Cloud Computation ◽

Computational Resources ◽

Cloud Analysis

AbstractA major goal of computational neuroscience is to develop powerful analysis tools that operate on large datasets. These methods provide an essential toolset to unlock scientific insights from new experiments. Unfortunately, a major obstacle currently impedes progress: while existing analysis methods are frequently shared as open source software, the infrastructure needed to deploy these methods – at scale, reproducibly, cheaply, and quickly – remains totally inaccessible to all but a minority of expert users. As a result, many users can not fully exploit these tools, due to constrained computational resources (limited or costly compute hardware) and/or mismatches in expertise (experimentalists vs. large-scale computing experts). In this work we develop Neuroscience Cloud Analysis As a Service (NeuroCAAS): a fully-managed infrastructure platform, based on modern large-scale computing advances, that makes state-of-the-art data analysis tools accessible to the neuroscience community. We offer NeuroCAAS as an open source service with a drag-and-drop interface, entirely removing the burden of infrastructure expertise, purchasing, maintenance, and deployment. NeuroCAAS is enabled by three key contributions. First, NeuroCAAS cleanly separates tool implementation from usage, allowing cutting-edge methods to be served directly to the end user with no need to read or install any analysis software. Second, NeuroCAAS automatically scales as needed, providing reliable, highly elastic computational resources that are more efficient than personal or lab-supported hardware, without management overhead. Finally, we show that many popular data analysis tools offered through NeuroCAAS outperform typical analysis solutions (in terms of speed and cost) while improving ease of use and maintenance, dispelling the myth that cloud compute is prohibitively expensive and technically inaccessible. By removing barriers to fast, efficient cloud computation, NeuroCAAS can dramatically accelerate both the dissemination and the effective use of cutting-edge analysis tools for neuroscientific discovery.

Download Full-text

MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis

10.1101/320887 ◽

2018 ◽

Cited By ~ 2

Author(s):

Li Chen ◽

Bai Zhang ◽

Michael Schnaubelt ◽

Punit Shah ◽

Paul Aiyetan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Open Source ◽

High Performance ◽

Large Scale ◽

Rapid Development ◽

Data File ◽

Data Sets ◽

Proteomics Data ◽

Amazon Web Services

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/

Download Full-text

New Instrumentation for Transient Follow-Up

Proceedings of the International Astronomical Union ◽

10.1017/s1743921318002715 ◽

2017 ◽

Vol 14 (S339) ◽

pp. 257-262

Author(s):

C. C. Thöne ◽

A. de Ugarte Postigo ◽

A. Mahabal ◽

D. A. Kann

Keyword(s):

Data Analysis ◽

Gravitational Waves ◽

Large Scale ◽

Scale Up ◽

Wide Angle ◽

New Instrumentation ◽

The Right

AbstractWide-angle surveys at different wavelengths are already providing triggers for very different kinds of transients. The most interesting science is produced when new sources are followed-up and characterised by using the right instrumentation, telescopes and observing strategies. In the coming years, with new large-scale surveys such as ZTF and LSST, the amount of triggers is expected to scale up massively. Furthermore, new observational windows, such as gravitational waves or neutrinos, are now opening and adding complexity to the picture. The instrumentation and strategies that we have been using over recent years may just not be appropriate for those new situations. In this Workshop we discussed the present and projected future of transient discovery, the instrumentation that will be needed for the follow-up of those targets, and the observing strategies, data analysis and community efforts that will be required to tackle the challenges that lie ahead of us.

Download Full-text