Integrated Genome Browser: visual analytics platform for genomics

Mapping Intimacies ◽

10.1101/026351 ◽

2015 ◽

Author(s):

Nowlan H. Freese ◽

David C. Norris ◽

Ann E. Loraine

Keyword(s):

Open Source ◽

Visual Analytics ◽

Large Scale ◽

High Throughput Sequencing ◽

Genomic Data ◽

Genome Browser ◽

Data Availability ◽

Data Sets ◽

Related Data ◽

Genome Scale

Motivation: Genome browsers that support fast navigation and interactive visual analytics can help scientists achieve deeper insight into large-scale genomic data sets more quickly, thus accelerating the discovery process. Toward this end, we developed Integrated Genome Browser (IGB), a highly configurable, interactive and fast open source desktop genome browser. Results: Here we describe multiple updates to IGB, including all-new capability to display and interact with data from high-throughput sequencing experiments. To demonstrate, we describe example visualizations and analyses of data sets from RNA-Seq, ChIP-Seq, and bisulfite sequencing experiments. Understanding results from genome-scale experiments requires viewing the data in the context of reference genome annotations and other related data sets. To facilitate this, we enhanced IGB's ability to consume data from diverse sources, including Galaxy, Distributed Annotation, and IGB-specific Quickload servers. To support future visualization needs as new genome-scale assays enter wide use, we transformed the IGB codebase into a modular, extensible platform for developers to create and deploy all-new visualizations of genomic data. Availability: IGB is open source and is freely available from http://bioviz.org/igb.

Download Full-text

Data safe havens to combine health and genomic data: benefits and challenges

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.348 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Kerina H Jones ◽

Arron S Lacey ◽

Brian L Perkins ◽

Mark I Rees

Keyword(s):

Association Studies ◽

Genomic Data ◽

Population Level ◽

Data Availability ◽

Genome Wide Association Studies ◽

Related Data ◽

Research Areas ◽

Individual Privacy ◽

Access Controls ◽

Health Related

ABSTRACTObjectivesData safe havens can bring together and combine a rich array of anonymised person-based data for research and policy evaluation within a secure setting. To date, the majority of available datasets have been structured micro-data derived from routine health-related records. Possibilities are opening up for the greater reuse of genomic data such as Genome Wide Association studies (GWAS) and Whole Exome/Genome Sequencing (WES or WGS). However, there are considerable challenges to be addressed if the benefits of using these data in combination with health-related data are to be realized safely. ApproachWe explore the benefits and challenges of using genomic datasets with health-related data, and using the Secure Anonymised Information Linkage (SAIL) system as a case study, the implications and way forward for Data Safe Havens in seeking to incorporate genomic data for use with health-related data. ResultsThe benefits of using GWAS, WES and WGS data in conjunction with health-related data include the potential to explore genetics at a population level and open up novel research areas. These include the ability to increasingly stratify and personalize how medical indications are detected and treated through precision medicine by understanding rare conditions and adding socioeconomic and environmental context to genomic data. Among the challenges are: data availability, computing capacity, technical solutions, legal and regulatory frameworks, public perceptions, individual privacy and organizational risk. Many of the challenges within these areas are common to person-based data in general, and often Data Safe Havens have been designed to address these. But there are also aspects of these challenges, and other challenges, specific to genomic data. These include issues due to the unknown clinical significance of genomic information now or in the future, with corresponding risks for privacy and impact on individuals. ConclusionGenomic data sets contain vast amounts of valuable information, some of which is currently undefined, but which may have direct bearing on individual health at some point. The use of these data in combination with health-related data has the potential to bring great benefits, better clinical trial stratification, epidemiology project design and clinical improvements. It is, therefore, essential that such data are surrounded by a properly-designed, robust governance framework including technical and procedural access controls that enable the data to be used safely.

Download Full-text

Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex

Lecture Notes in Computer Science - Brain-Inspired Computing ◽

10.1007/978-3-030-82427-3_4 ◽

2021 ◽

pp. 47-59

Author(s):

Sacha J. van Albada ◽

Jari Pronold ◽

Alexander van Meegen ◽

Markus Diesmann

Keyword(s):

Open Source ◽

Large Scale ◽

Network Models ◽

Macaque Monkey ◽

Source Model ◽

Model Specification ◽

Data Sets ◽

Neural Network Models ◽

Wide Range ◽

Ict Infrastructure

AbstractWe are entering an age of ‘big’ computational neuroscience, in which neural network models are increasing in size and in numbers of underlying data sets. Consolidating the zoo of models into large-scale models simultaneously consistent with a wide range of data is only possible through the effort of large teams, which can be spread across multiple research institutions. To ensure that computational neuroscientists can build on each other’s work, it is important to make models publicly available as well-documented code. This chapter describes such an open-source model, which relates the connectivity structure of all vision-related cortical areas of the macaque monkey with their resting-state dynamics. We give a brief overview of how to use the executable model specification, which employs NEST as simulation engine, and show its runtime scaling. The solutions found serve as an example for organizing the workflow of future models from the raw experimental data to the visualization of the results, expose the challenges, and give guidance for the construction of an ICT infrastructure for neuroscience.

Download Full-text

An evaluation of the accuracy and speed of metagenome analysis tools

Scientific Reports ◽

10.1038/srep19233 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 187

Author(s):

Stinus Lindgreen ◽

Karen L. Adair ◽

Paul P. Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

Capacity Data ◽

High Degree ◽

Realistic Data

Abstract Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition and functional capacity. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text

Large-Scale Analysis of Genetic and Clinical Patient Data

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-080917-013508 ◽

2018 ◽

Vol 1 (1) ◽

pp. 263-274 ◽

Cited By ~ 6

Author(s):

Marylyn D. Ritchie

Keyword(s):

Clinical Data ◽

Large Scale ◽

Data Science ◽

Genomic Analysis ◽

Genomic Data ◽

Data Sets ◽

Biomedical Data ◽

Data Types ◽

Phenotypic Data ◽

Clinical Patient

Biomedical data science has experienced an explosion of new data over the past decade. Abundant genetic and genomic data are increasingly available in large, diverse data sets due to the maturation of modern molecular technologies. Along with these molecular data, dense, rich phenotypic data are also available on comprehensive clinical data sets from health care provider organizations, clinical trials, population health registries, and epidemiologic studies. The methods and approaches for interrogating these large genetic/genomic and clinical data sets continue to evolve rapidly, as our understanding of the questions and challenges continue to emerge. In this review, the state-of-the-art methodologies for genetic/genomic analysis along with complex phenomics will be discussed. This field is changing and adapting to the novel data types made available, as well as technological advances in computation and machine learning. Thus, I will also discuss the future challenges in this exciting and innovative space. The promises of precision medicine rely heavily on the ability to marry complex genetic/genomic data with clinical phenotypes in meaningful ways.

Download Full-text

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets

BioMed Research International ◽

10.1155/2015/218068 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Yipu Zhang ◽

Ping Wang

Keyword(s):

High Throughput ◽

Motif Discovery ◽

Large Scale ◽

High Throughput Sequencing ◽

Es Cells ◽

Motif Finding ◽

Data Sets ◽

Data Set ◽

Binding Motifs ◽

Motif Finding Algorithm

New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the(l, d)motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the(l, d)motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.

Download Full-text

WEB MAPPING ARCHITECTURES BASED ON OPEN SPECIFICATIONS AND FREE AND OPEN SOURCE SOFTWARE IN THE WATER DOMAIN

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-iv-2-w4-23-2017 ◽

2017 ◽

Vol IV-2/W4 ◽

pp. 23-30

Author(s):

C. Arias Muñoz ◽

M. A. Brovelli ◽

C. E. Kilsedar ◽

R. Moreno-Sanchez ◽

D. Oxoli

Keyword(s):

Data Collection ◽

Open Source ◽

Open Source Software ◽

Data Availability ◽

End User ◽

Web Based ◽

Web Mapping ◽

Related Data ◽

Data Formats ◽

Information Assets

The availability of water-related data and information across different geographical and jurisdictional scales is of critical importance for the conservation and management of water resources in the 21<sup>st</sup> century. Today information assets are often found fragmented across multiple agencies that use incompatible data formats and procedures for data collection, storage, maintenance, analysis, and distribution. The growing adoption of Web mapping systems in the water domain is reducing the gap between data availability and its practical use and accessibility. Nevertheless, more attention must be given to the design and development of these systems to achieve high levels of interoperability and usability while fulfilling different end user informational needs. This paper first presents a brief overview of technologies used in the water domain, and then presents three examples of Web mapping architectures based on free and open source software (FOSS) and the use of open specifications (OS) that address different users’ needs for data sharing, visualization, manipulation, scenario simulations, and map production. The purpose of the paper is to illustrate how the latest developments in OS for geospatial and water-related data collection, storage, and sharing, combined with the use of mature FOSS projects facilitate the creation of sophisticated interoperable Web-based information systems in the water domain.

Download Full-text

MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis

10.1101/320887 ◽

2018 ◽

Cited By ~ 2

Author(s):

Li Chen ◽

Bai Zhang ◽

Michael Schnaubelt ◽

Punit Shah ◽

Paul Aiyetan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Open Source ◽

High Performance ◽

Large Scale ◽

Rapid Development ◽

Data File ◽

Data Sets ◽

Proteomics Data ◽

Amazon Web Services

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/

Download Full-text

Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

Molecular Biology and Evolution ◽

10.1093/molbev/msaa130 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3047-3060

Author(s):

Xiang Ji ◽

Zhenyu Zhang ◽

Andrew Holbrook ◽

Akihiko Nishimura ◽

Guy Baele ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Linear Time ◽

Phylogenetic Reconstruction ◽

Fold Increase ◽

Time Algorithm ◽

Data Sets ◽

Lassa Virus ◽

Computational Performance ◽

Computational Bottleneck

Abstract Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

Download Full-text

DataMeadow: A Visual Canvas for Analysis of Large-Scale Multivariate Data

Information Visualization ◽

10.1057/palgrave.ivs.9500170 ◽

2008 ◽

Vol 7 (1) ◽

pp. 18-33 ◽

Cited By ~ 34

Author(s):

Niklas Elmqvist ◽

John Stasko ◽

Philippas Tsigas

Keyword(s):

Visual Analytics ◽

Large Scale ◽

Multidimensional Data ◽

Data Sets ◽

Data Set ◽

Data Dependencies ◽

Expert Review ◽

History Of ◽

Multidimensional Data Sets ◽

High Degree

Supporting visual analytics of multiple large-scale multidimensional data sets requires a high degree of interactivity and user control beyond the conventional challenges of visualizing such data sets. We present the DataMeadow, a visual canvas providing rich interaction for constructing visual queries using graphical set representations called DataRoses. A DataRose is essentially a starplot of selected columns in a data set displayed as multivariate visualizations with dynamic query sliders integrated into each axis. The purpose of the DataMeadow is to allow users to create advanced visual queries by iteratively selecting and filtering into the multidimensional data. Furthermore, the canvas provides a clear history of the analysis that can be annotated to facilitate dissemination of analytical results to stakeholders. A powerful direct manipulation interface allows for selection, filtering, and creation of sets, subsets, and data dependencies. We have evaluated our system using a qualitative expert review involving two visualization researchers. Results from this review are favorable for the new method.

Download Full-text

An evaluation of the accuracy and speed of metagenome analysis tools

10.1101/017830 ◽

2015 ◽

Cited By ~ 10

Author(s):

Stinus Lindgreen ◽

Karen L Adair ◽

Paul Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

State Of The Art ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

High Degree ◽

Realistic Data

Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text