MLPAnalyzer: Data Analysis Tool for Reliable Automated Normalization of MLPA Fragment Data

Jordy Coffa; Mark A. van de Wiel; Begoña Diosdado; Beatriz Carvalho; Jan Schouten; Gerrit A. Meijer

doi:10.1155/2008/605109

MLPAnalyzer: Data Analysis Tool for Reliable Automated Normalization of MLPA Fragment Data

Analytical Cellular Pathology ◽

10.1155/2008/605109 ◽

2008 ◽

Vol 30 (4) ◽

pp. 323-335

Author(s):

Jordy Coffa ◽

Mark A. van de Wiel ◽

Begoña Diosdado ◽

Beatriz Carvalho ◽

Jan Schouten ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Copy Number ◽

Data Sets ◽

Analysis Tool ◽

Visual Examination ◽

High Throughput Analysis ◽

Analysis Strategy ◽

Suggested Strategy ◽

Copy Number Changes

Background: Multiplex Ligation dependent Probe Amplification (MLPA) is a rapid, simple, reliable and customized method for detection of copy number changes of individual genes at a high resolution and allows for high throughput analysis. This technique is typically applied for studying specific genes in large sample series. The large amount of data, dissimilarities in PCR efficiency among the different probe amplification products, and sample-to-sample variation pose a challenge to data analysis and interpretation. We therefore set out to develop an MLPA data analysis strategy and tool that is simple to use, while still taking into account the above-mentioned sources of variation.Materials and Methods: MLPAnalyzer was developed in Visual Basic for Applications, and can accept a large number of file formats directly from capillary sequence systems. Sizes of all MLPA probe signals are determined and filtered, quality control steps are performed, and variation in peak intensity related to size is corrected for. DNA copy number ratios of test samples are computed, displayed in a table view and a set of comprehensive figures is generated. To validate this approach, MLPA reactions were performed using a dedicated MLPA mix on 6 different colorectal cancer cell lines. The generated data were normalized using our program and results were compared to previously performed array-CGH results using both statistical methods and visual examination.Results and Discussion: Visual examination of bar graphs and direct ratios for both techniques showed very similar results, while the average Pearson moment correlation over all MLPA probes was found to be 0.42. Our results thus show that automated MLPA data processing following our suggested strategy may be of significant use, especially when handling large MLPA data sets, when samples are of different quality, or interpretation of MLPA electropherograms is too complex. It remains, however, important to recognize that automated MLPA data processing may only be successful when a dedicated experimental setup is also considered.

Download Full-text

Design and Application of a Containerized Hybrid Transaction Processing and Data Analysis Framework

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2018070106 ◽

2018 ◽

Vol 10 (3) ◽

pp. 76-90

Author(s):

Ye Tao ◽

Xiaodong Wang ◽

Xiaowei Xu

Keyword(s):

Data Analysis ◽

Data Processing ◽

User Interfaces ◽

Relational Databases ◽

Transaction Processing ◽

Data Sets ◽

Healthcare Data ◽

Development Framework ◽

Hybrid Development ◽

Transaction Processing Systems

This article describes how rapidly growing data volumes require systems that have the ability to handle massive heterogeneous unstructured data sets. However, most existing mature transaction processing systems are built upon relational databases with structured data. In this article, the authors design a hybrid development framework, to offer greater scalability and flexibility of data analysis and reporting, while keeping maximum compatibility and links to the legacy platforms on which transaction business logics run. Data, service and user interfaces are implemented as a toolset stack, for developing applications with functionalities of information retrieval, data processing, analyzing and visualizing. A use case of healthcare data integration is presented as an example, where information is collected and aggregated from diverse sources. The workflow and simulation of data processing and visualization are also discussed, to validate the effectiveness of the proposed framework.

Download Full-text

Development of a Powerful Data-Analysis Tool Using Nonparametric Smoothing Models To Identify Drillsites in Tight Shale Reservoirs With High Economic Potential

SPE Journal ◽

10.2118/189440-pa ◽

2017 ◽

Vol 23 (03) ◽

pp. 719-736 ◽

Cited By ~ 2

Author(s):

Quan Cai ◽

Wei Yu ◽

Hwa Chi Liang ◽

Jenn-Tai Liang ◽

Suojin Wang ◽

...

Keyword(s):

Big Data ◽

Data Analysis ◽

Predictive Power ◽

Oil And Gas ◽

Predictor Variables ◽

Data Sets ◽

Analysis Tool ◽

Data Set ◽

Data Analysis Tool ◽

Shale Reservoirs

Summary The oil-and-gas industry is entering an era of “big data” because of the huge number of wells drilled with the rapid development of unconventional oil-and-gas reservoirs during the past decade. The massive amount of data generated presents a great opportunity for the industry to use data-analysis tools to help make informed decisions. The main challenge is the lack of the application of effective and efficient data-analysis tools to analyze and extract useful information for the decision-making process from the enormous amount of data available. In developing tight shale reservoirs, it is critical to have an optimal drilling strategy, thereby minimizing the risk of drilling in areas that would result in low-yield wells. The objective of this study is to develop an effective data-analysis tool capable of dealing with big and complicated data sets to identify hot zones in tight shale reservoirs with the potential to yield highly productive wells. The proposed tool is developed on the basis of nonparametric smoothing models, which are superior to the traditional multiple-linear-regression (MLR) models in both the predictive power and the ability to deal with nonlinear, higher-order variable interactions. This data-analysis tool is capable of handling one response variable and multiple predictor variables. To validate our tool, we used two real data sets—one with 249 tight oil horizontal wells from the Middle Bakken and the other with 2,064 shale gas horizontal wells from the Marcellus Shale. Results from the two case studies revealed that our tool not only can achieve much better predictive power than the traditional MLR models on identifying hot zones in the tight shale reservoirs but also can provide guidance on developing the optimal drilling and completion strategies (e.g., well length and depth, amount of proppant and water injected). By comparing results from the two data sets, we found that our tool can achieve model performance with the big data set (2,064 Marcellus wells) with only four predictor variables that is similar to that with the small data set (249 Bakken wells) with six predictor variables. This implies that, for big data sets, even with a limited number of available predictor variables, our tool can still be very effective in identifying hot zones that would yield highly productive wells. The data sets that we have access to in this study contain very limited completion, geological, and petrophysical information. Results from this study clearly demonstrated that the data-analysis tool is certainly powerful and flexible enough to take advantage of any additional engineering and geology data to allow the operators to gain insights on the impact of these factors on well performance.

Download Full-text

Data Analysis Considerations for Detecting Copy Number Changes in Formalin-Fixed, Paraffin-Embedded Tissues

Cold Spring Harbor Protocols ◽

10.1101/pdb.ip071761 ◽

2012 ◽

Vol 2012 (11) ◽

pp. pdb.ip071761-pdb.ip071761 ◽

Cited By ~ 1

Author(s):

S. Jacobs

Keyword(s):

Data Analysis ◽

Copy Number ◽

Formalin Fixed Paraffin ◽

Formalin Fixed Paraffin Embedded ◽

Copy Number Changes ◽

Formalin Fixed

Download Full-text

Simple statistics for complex Earthquakes’ time distribution

10.5194/npg-2017-77 ◽

2018 ◽

Author(s):

Teimuraz Matcharashvili ◽

Takahiro Hatano ◽

Tamaz Chelidze ◽

Natalia Zhukova

Keyword(s):

Data Analysis ◽

Calculation Method ◽

Time Distribution ◽

Earthquake Catalogue ◽

Data Sets ◽

Analysis Tool ◽

Complex Data ◽

Multi Scale ◽

Noise Data ◽

Quantification Analysis

Abstract. Here we investigated a statistical feature of earthquakes time distribution in southern Californian earthquake catalogue. As a main data analysis tool, we used simple statistical approach based on the calculation of integral deviation times (IDT) from the time distribution of regular markers. The research objective is to define whether the process of earthquakes time distribution approaches to randomness. Effectiveness of the IDT calculation method was tested on the set of simulated color noise data sets with the different extent of regularity. Standard methods of complex data analysis have also been used, such as power spectrum regression, Lempel and Ziv complexity and recurrence quantification analysis as well as multi-scale entropy calculation. After testing the IDT calculation method for simulated model data sets, we have analyzed the variation of the extent of regularity in southern Californian earthquake catalogue. Analysis was carried out for different time periods and at different magnitude thresholds. It was found that the extent of the order in earthquakes time distribution is fluctuating over the catalogue. Particularly, we show that the process of earthquakes’ time distribution becomes most random-like in periods of decreased local seismic activity.

Download Full-text

Simple statistics for complex Earthquake time distributions

Nonlinear Processes in Geophysics ◽

10.5194/npg-25-497-2018 ◽

2018 ◽

Vol 25 (3) ◽

pp. 497-510 ◽

Cited By ~ 4

Author(s):

Teimuraz Matcharashvili ◽

Takahiro Hatano ◽

Tamaz Chelidze ◽

Natalia Zhukova

Keyword(s):

Data Analysis ◽

Calculation Method ◽

Time Distribution ◽

Southern California ◽

Multiscale Entropy ◽

Data Sets ◽

Analysis Tool ◽

Earthquake Catalog ◽

Complex Data ◽

Process Data

Abstract. Here we investigated a statistical feature of earthquake time distributions in the southern California earthquake catalog. As a main data analysis tool, we used a simple statistical approach based on the calculation of integral deviation times (IDT) from the time distribution of regular markers. The research objective is to define whether and when the process of earthquake time distribution approaches to randomness. Effectiveness of the IDT calculation method was tested on the set of simulated color noise data sets with the different extent of regularity, as well as for Poisson process data sets. Standard methods of complex data analysis have also been used, such as power spectrum regression, Lempel and Ziv complexity, and recurrence quantification analysis, as well as multiscale entropy calculations. After testing the IDT calculation method for simulated model data sets, we have analyzed the variation in the extent of regularity in the southern California earthquake catalog. Analysis was carried out for different periods and at different magnitude thresholds. It was found that the extent of the order in earthquake time distributions is fluctuating over the catalog. Particularly, we show that in most cases, the process of earthquake time distributions is less random in periods of strong earthquake occurrence compared to periods with relatively decreased local seismic activity. Also, we noticed that the strongest earthquakes occur in periods when IDT values increase.

Download Full-text

An approach to improving the analysis of literature data in Chinese through an improved use of Citespace

Knowledge Management & E-Learning: An International Journal ◽

10.34105/j.kmel.2020.12.013 ◽

2020 ◽

pp. 256-267

Keyword(s):

Data Analysis ◽

Data Collection ◽

Data Processing ◽

Empirical Evaluation ◽

Knowledge Map ◽

Analysis Tool ◽

Map Generation ◽

Effective Analysis

Citespace, a visualization-based analysis tool, has been used to analyze the literature data by visualizing the patterns and potential trends of a field. Previous studies show that when used for analyzing the literature in Chinese, Citespace could only conduct very basic analysis, different from its use in analyzing the literature data in English. To address this limitation, this study presents an approach to improving the use of Citespace for effective analysis of literature data in Chinese. The approach employs data-processing and data-analysis scripts in data collection, knowledge map generation, and interpretation steps to improve the accuracy and comprehensiveness of analysis of literature data in Chinese. An empirical evaluation has been conducted to demonstrate the effectiveness of the approach.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Gates Open Research ◽

10.12688/gatesopenres.12832.1 ◽

2018 ◽

Vol 2 ◽

pp. 31 ◽

Cited By ~ 1

Author(s):

Greg Finak ◽

Bryan Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

10.1101/342907 ◽

2018 ◽

Author(s):

Greg Finak ◽

Bryan T. Mayer ◽

William Fulp ◽

Paul Obrecht ◽

Alicia Sato ◽

...

Keyword(s):

Data Analysis ◽

Data Processing ◽

Relational Databases ◽

Research Work ◽

Large Data ◽

Work Flow ◽

Primary Data ◽

Reproducible Research ◽

Data Sets ◽

Data Set

AbstractA central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ‘omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high “startup” costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Download Full-text

CYPminer: an automated cytochrome P450 identification, classification, and data analysis tool for genome data sets across kingdoms

BMC Bioinformatics ◽

10.1186/s12859-020-3473-2 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Ohgew Kweon ◽

Seong-Jae Kim ◽

Jae Hyun Kim ◽

Seong Won Nho ◽

Dongryeoul Bae ◽

...

Keyword(s):

Cytochrome P450 ◽

Data Analysis ◽

Data Sets ◽

Analysis Tool ◽

Genome Data ◽

Data Analysis Tool

Download Full-text

Large-scale correlations between eight key double strand break related data sets over the whole human genome

10.1101/581173 ◽

2019 ◽

Author(s):

Anders Brahme ◽

Maj Hultén ◽

Carin Bengtsson ◽

Andreas Hultgren ◽

Anders Zetterberg

Keyword(s):

Human Genome ◽

Copy Number ◽

High Probability ◽

Large Scale ◽

Fragile Sites ◽

Dna Lesions ◽

Data Sets ◽

Breast Cancers ◽

Cancer Induction ◽

Copy Number Changes

AbstractEight different data sets, covering the whole human genome are compared with regard to their genomic distribution. A close correlation between cytological detected chiasma and MLH1 immunofluorescence sites with the recombination density distribution from the HapMap project was found. Sites with a high probability of chromatid breakage after exposure to low and high ionization density radiations are often located inside common and rare Fragile Sites (FSs) indicating that the common Radiation-Induced Breakpoint sites (RIBs) may be a new kind of more local fragility. Furthermore, Oncogenes and other cancer-related genes are commonly located in regions with an increased probability of rearrangements during genomic recombination, or in regions with high probability of copy number changes, possibly since these processes may be involved in oncogene activation and cancer induction. An increased CpG density is linked to regions of high gene density to secure high fidelity reproduction and survival. To minimize cancer induction these genes are often located in regions of decreased recombination density and/or higher than average CpG density. Interestingly, copy number changes occur predominantly at common RIBs and/or FSs at least for breast cancers with poor prognosis and they decrease weakly but significantly in regions with increasing recombination density and CpG density. It is compelling that all these datasets are influenced by the cells handling of double strand breaks and more generally DNA damage on its genome. In fact, the DNA repair genes are systematically avoiding regions with a high recombination density. This may be a consequence of natural selection, as they need to be intact to accurately handle repairable DNA lesions.

Download Full-text