Simulation with RADinitio Improves RADseq Experimental Design and Sheds Light on Sources of Missing Data

Intuitive Web-Based Experimental Design for High-Throughput Biomedical Data

BioMed Research International ◽

10.1155/2015/958302 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 1

Author(s):

Andreas Friedrich ◽

Erhan Kenar ◽

Oliver Kohlbacher ◽

Sven Nahnsen

Keyword(s):

Big Data ◽

Experimental Design ◽

High Throughput ◽

Large Scale ◽

Integrated Design ◽

Added Value ◽

Biomedical Data ◽

Data Generation ◽

Web Based ◽

Data Annotation

Big data bioinformatics aims at drawing biological conclusions from huge and complex biological datasets. Added value from the analysis of big data, however, is only possible if the data is accompanied by accurate metadata annotation. Particularly in high-throughput experiments intelligent approaches are needed to keep track of the experimental design, including the conditions that are studied as well as information that might be interesting for failure analysis or further experiments in the future. In addition to the management of this information, means for an integrated design and interfaces for structured data annotation are urgently needed by researchers. Here, we propose a factor-based experimental design approach that enables scientists to easily create large-scale experiments with the help of a web-based system. We present a novel implementation of a web-based interface allowing the collection of arbitrary metadata. To exchange and edit information we provide a spreadsheet-based, humanly readable format. Subsequently, sample sheets with identifiers and metainformation for data generation facilities can be created. Data files created after measurement of the samples can be uploaded to a datastore, where they are automatically linked to the previously created experimental design model.

Download Full-text

Exploring Single-Cell Data with Deep Multitasking Neural Networks

10.1101/237065 ◽

2017 ◽

Cited By ~ 7

Author(s):

Matthew Amodio ◽

David van Dijk ◽

Krishnan Srinivasan ◽

William S Chen ◽

Hussein Mohsen ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Data Analysis ◽

Experimental Design ◽

Single Cell ◽

Large Scale ◽

Dengue Infection ◽

Data Representation ◽

Data Generation ◽

Cell Data

AbstractBiomedical researchers are generating high-throughput, high-dimensional single-cell data at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different single-cell samples in the same dataset. These samples can correspond to different patients, conditions, or treatments. While scalability of methods to datasets of these sizes is a challenge on its own, dealing with large-scale experimental design presents a whole new set of problems, including batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients or conditions). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline. For instance, data visualization methods can be disjoint and mismatched with the clustering method. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural networks, as well as the deep representation of data that can be learned by them to perform many single-cell data analysis tasks, all on a unified representation.A well-known limitation of neural networks is their interpretability. Our key contribution here are newly formulated regularizations (penalties) that render features learned in hidden layers of the neural network interpretable. When large multi-patient datasets are fed into SAUCIE, the various hidden layers contain denoised and batch-corrected data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch correct and process this 11-million cell data to identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.

Download Full-text

Simphony: simulating large-scale, rhythmic data

10.1101/497859 ◽

2018 ◽

Author(s):

Jordan M Singer ◽

Darwin Y Fu ◽

Jacob J Hughey

Keyword(s):

Experimental Design ◽

Large Scale ◽

Method Development ◽

Negative Binomial ◽

Simulated Data ◽

General Purpose ◽

Computational Method ◽

Next Generation Sequencing Data ◽

Multiple Time ◽

Sequencing Data

Simulated data are invaluable for assessing a computational method's ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature's rhythmic properties (e.g., shape, amplitude, and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from next-generation sequencing data. We show an example of using Simphony to benchmark a method for detecting rhythms. Our results suggest that Simphony can aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.

Download Full-text

Genome-wide identification of directed gene networks using large-scale population genomics data

10.1101/221879 ◽

2017 ◽

Cited By ~ 2

Author(s):

René Luijk ◽

Koen F. Dekkers ◽

Maarten van Iterson ◽

Wibowo Arindrarto ◽

Annique Claringbould ◽

...

Keyword(s):

Gene Networks ◽

Large Scale ◽

Target Genes ◽

Population Genomics ◽

Regulatory Gene ◽

Sequencing Data ◽

Scale Population ◽

Transcriptional Changes ◽

New Gene ◽

Novel Target

ABSTRACTIdentification of causal drivers behind regulatory gene networks is crucial in understanding gene function. We developed a method for the large-scale inference of gene-gene interactions in observational population genomics data that are both directed (using local genetic instruments as causal anchors, akin to Mendelian Randomization) and specific (by controlling for linkage disequilibrium and pleiotropy). The analysis of genotype and whole-blood RNA-sequencing data from 3,072 individuals identified 49 genes as drivers of downstream transcriptional changes (P < 7 × 10−10), among which transcription factors were overrepresented (P = 3.3 × 10−7). Our analysis suggests new gene functions and targets including for SENP7 (zinc-finger genes involved in retroviral repression) and BCL2A1 (novel target genes possibly involved in auditory dysfunction). Our work highlights the utility of population genomics data in deriving directed gene expression networks. A resource of trans-effects for all 6,600 genes with a genetic instrument can be explored individually using a web-based browser.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

BasePlayer: Versatile Analysis Software for Large-scale Genomic Variant Discovery

10.1101/126482 ◽

2017 ◽

Cited By ~ 3

Author(s):

Riku Katainen ◽

Iikki Donner ◽

Tatiana Cajuso ◽

Eevi Kaasinen ◽

Kimmo Palin ◽

...

Keyword(s):

Large Scale ◽

Population Genomics ◽

Genetic Research ◽

Genomic Research ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer ◽

Variant Analysis ◽

Genome Level ◽

Ngs Data

AbstractNext-generation sequencing (NGS) is being routinely applied in life sciences and clinical practice, where the interpretation of the resulting massive data has become a critical challenge. Computational workflows, such as the Broad GATK, have been established to take raw sequencing data and produce processed data for downstream analyses. Consequently, results of these computationally demanding workflows, consisting of e.g. sequence alignment and variant calling, are increasingly being provided for customers by sequencing and bioinformatics facilities. However, downstream variant analysis, whole-genome level in particular, has been lacking a multi-purpose tool, which could take advantage of rapidly growing genomic information and integrate genetic variant, sequence, genomic annotation and regulatory (e.g. ENCODE) data interactively and in a visual fashion. Here we introduce a highly efficient and user-friendly software, BasePlayer (http://baseplayer.fi), for biological discovery in large-scale NGS data. BasePlayer enables tightly integrated comparative variant analysis and visualization of thousands of NGS data samples and millions of variants, with numerous applications in disease, regulatory and population genomics. Although BasePlayer has been designed primarily for whole-genome and exome sequencing data, it is well-suited to various study settings, diseases and organisms by supporting standard and upcoming file formats. BasePlayer transforms an ordinary desktop computer into a large-scale genomic research platform, enabling also a non-technical user to perform complex comparative variant analyses, population frequency filtering and genome level annotations under intuitive, scalable and highly-responsive user interface to facilitate everyday genetic research as well as the search of novel discoveries.

Download Full-text

Streamlining data-intensive biology with workflow systems

GigaScience ◽

10.1093/gigascience/giaa140 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taylor Reiter ◽

Phillip T Brooks† ◽

Luiz Irber† ◽

Shannon E K Joslin† ◽

Charles M Reid† ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Data ◽

Data Generation ◽

Sequencing Data ◽

Workflow Systems ◽

Data Intensive ◽

High Throughput Sequencing Data ◽

Project Data

Abstract As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Population Genomics of American Mink Using Whole Genome Sequencing Data

Genes ◽

10.3390/genes12020258 ◽

2021 ◽

Vol 12 (2) ◽

pp. 258

Author(s):

Karim Karimi ◽

Duy Ngoc Do ◽

Mehdi Sargolzaei ◽

Younes Miar

Keyword(s):

Population Genomics ◽

Association Studies ◽

American Mink ◽

Population History ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Effective Population ◽

Cross Validation Error

Characterizing the genetic structure and population history can facilitate the development of genomic breeding strategies for the American mink. In this study, we used the whole genome sequences of 100 mink from the Canadian Centre for Fur Animal Research (CCFAR) at the Dalhousie Faculty of Agriculture (Truro, NS, Canada) and Millbank Fur Farm (Rockwood, ON, Canada) to investigate their population structure, genetic diversity and linkage disequilibrium (LD) patterns. Analysis of molecular variance (AMOVA) indicated that the variation among color-types was significant (p < 0.001) and accounted for 18% of the total variation. The admixture analysis revealed that assuming three ancestral populations (K = 3) provided the lowest cross-validation error (0.49). The effective population size (Ne) at five generations ago was estimated to be 99 and 50 for CCFAR and Millbank Fur Farm, respectively. The LD patterns revealed that the average r2 reduced to <0.2 at genomic distances of >20 kb and >100 kb in CCFAR and Millbank Fur Farm suggesting that the density of 120,000 and 24,000 single nucleotide polymorphisms (SNP) would provide the adequate accuracy of genomic evaluation in these populations, respectively. These results indicated that accounting for admixture is critical for designing the SNP panels for genotype-phenotype association studies of American mink.

Download Full-text

LncGSEA: a versatile tool to infer lncRNA associated pathways from large-scale cancer transcriptome sequencing data

BMC Genomics ◽

10.1186/s12864-021-07900-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yanan Ren ◽

Ting-You Wang ◽

Leah C. Anderton ◽

Qi Cao ◽

Rendong Yang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Clinical Samples ◽

Sequencing Data ◽

Multiple Cancer ◽

Regulatory Pathways ◽

Cancer Transcriptome ◽

Versatile Tool

Abstract Background Long non-coding RNAs (lncRNAs) are a growing focus in cancer research. Deciphering pathways influenced by lncRNAs is important to understand their role in cancer. Although knock-down or overexpression of lncRNAs followed by gene expression profiling in cancer cell lines are established approaches to address this problem, these experimental data are not available for a majority of the annotated lncRNAs. Results As a surrogate, we present lncGSEA, a convenient tool to predict the lncRNA associated pathways through Gene Set Enrichment Analysis of gene expression profiles from large-scale cancer patient samples. We demonstrate that lncGSEA is able to recapitulate lncRNA associated pathways supported by literature and experimental validations in multiple cancer types. Conclusions LncGSEA allows researchers to infer lncRNA regulatory pathways directly from clinical samples in oncology. LncGSEA is written in R, and is freely accessible at https://github.com/ylab-hi/lncGSEA.

Download Full-text