Data-Driven Approach for Metabolite Relationship Recovery in Biological1H NMR Data Sets Using Iterative Statistical Total Correlation Spectroscopy

Caroline J. Sands; Muireann Coen; Timothy M. D. Ebbels; Elaine Holmes; John C. Lindon; Jeremy K. Nicholson

doi:10.1021/ac102870u

Statistical Total Correlation Spectroscopy: An Exploratory Approach for Latent Biomarker Identification from Metabolic1H NMR Data Sets

Analytical Chemistry ◽

10.1021/ac048630x ◽

2005 ◽

Vol 77 (5) ◽

pp. 1282-1289 ◽

Cited By ~ 581

Author(s):

Olivier Cloarec ◽

Marc-Emmanuel Dumas ◽

Andrew Craig ◽

Richard H. Barton ◽

Johan Trygg ◽

...

Keyword(s):

Correlation Spectroscopy ◽

Data Sets ◽

Biomarker Identification ◽

Nmr Data ◽

Total Correlation ◽

Exploratory Approach

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

A robust data-driven approach identifies four personality types across four large data sets

Nature Human Behaviour ◽

10.1038/s41562-018-0419-z ◽

2018 ◽

Vol 2 (10) ◽

pp. 735-742 ◽

Cited By ~ 41

Author(s):

Martin Gerlach ◽

Beatrice Farb ◽

William Revelle ◽

Luís A. Nunes Amaral

Keyword(s):

Large Data ◽

Personality Types ◽

Large Data Sets ◽

Data Driven ◽

Data Sets ◽

Data Driven Approach

Download Full-text

Outlying Sequence Detection in Large Data Sets: A data-driven approach

IEEE Signal Processing Magazine ◽

10.1109/msp.2014.2329428 ◽

2014 ◽

Vol 31 (5) ◽

pp. 44-56 ◽

Cited By ~ 26

Author(s):

Ali Tajer ◽

Venugopal V. Veeravalli ◽

H. Vincent Poor

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Driven ◽

Data Sets ◽

Sequence Detection ◽

Data Driven Approach

Download Full-text

A Method for Identifying Environmental Stimuli and Genes Responsible for Genotype-by-Environment Interactions From a Large-Scale Multi-Environment Data Set

Frontiers in Genetics ◽

10.3389/fgene.2021.803636 ◽

2021 ◽

Vol 12 ◽

Author(s):

Akio Onogi ◽

Daisuke Sekine ◽

Akito Kaga ◽

Satoshi Nakano ◽

Tetsuya Yamada ◽

...

Keyword(s):

Large Scale ◽

Genetic Correlations ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Environmental Stimuli ◽

Genotype By Environment ◽

Genome Wide ◽

Sowing Dates ◽

Data Driven Approach

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when these factors were involved in the interactions. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields.

Download Full-text

A Data-Driven Approach to Product Usage Context Identification From Online Customer Reviews

Journal of Mechanical Design ◽

10.1115/1.4044523 ◽

2019 ◽

Vol 141 (12) ◽

Author(s):

Dedy Suryadi ◽

Harrison M. Kim

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Consumer Satisfaction ◽

Data Driven ◽

Data Sets ◽

Product Usage ◽

Customer Reviews ◽

Online Customer Reviews ◽

Data Driven Approach ◽

Usage Context

Abstract This paper proposes a data-driven methodology to automatically identify product usage contexts from online customer reviews. Product usage context is one of the factors that affect product design, consumer behavior, and consumer satisfaction. The previous works identify the usage contexts using the survey-based method or subjectively determine them. The proposed methodology, on the other hand, uses machine learning and Natural Language Processing tools to identify and cluster usage contexts from a large volume of customer reviews. Furthermore, aspect sentiment analysis is applied to capture the sentiment toward a particular usage context in a sentence. The methodology is implemented to two data sets of products, i.e., laptop and tablet. The result shows that the methodology is able to capture relevant product usage contexts and cluster bigrams that refer to similar usage context. The aspect sentiment analysis enables the observation of a product’s position with respect to its competitors for a particular usage context. For a product designer, the observation may indicate a requirement to improve the product. It may also indicate a possible market opportunity in a usage context in which most of the current products are perceived negatively by customers. Finally, it is shown that overall rating might not be a strong indicator for representing customer sentiment toward a particular usage context, due to the moderate linear correlation for most of the usage contexts in the case study.

Download Full-text

Cognitive Profiles in Parkinson’s Disease and Their Relation to Dementia: A Data-Driven Approach

International Journal of Alzheimer s Disease ◽

10.1155/2012/910757 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Inga Liepelt-Scarfone ◽

Susanne Gräber ◽

Monika Fruhmann Berger ◽

Anne Feseker ◽

Gülsüm Baysal ◽

...

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Cognitive Impairment ◽

Hierarchical Cluster ◽

Individual Performance ◽

Data Driven ◽

Data Sets ◽

Cognitive Profiles ◽

Cluster Analyses ◽

Data Driven Approach

Parkinson’s disease is characterized by a substantial cognitive heterogeneity, which is apparent in different profiles and levels of severity. To date, a distinct clinical profile for patients with a potential risk of developing dementia still has to be identified. We introduce a data-driven approach to detect different cognitive profiles and stages. Comprehensive neuropsychological data sets from a cohort of 121 Parkinson’s disease patients with and without dementia were explored by a factor analysis to characterize different cognitive domains. Based on the factor scores that represent individual performance in each domain, hierarchical cluster analyses determined whether subgroups of Parkinson’s disease patients show varying cognitive profiles. A six-factor solution accounting for 65.2% of total variance fitted best to our data and revealed high internal consistencies (Cronbach’s alpha coefficients>0.6). The cluster analyses suggested two independent patient clusters with different cognitive profiles. They differed only in severity of cognitive impairment and self-reported limitation of activities of daily living function but not in motor performance, disease duration, or dopaminergic medication. Based on a data-driven approach, divers cognitive profiles were identified, which separated early and more advanced stages of cognitive impairment in Parkinson’s disease without dementia. Importantly, these profiles were independent of motor progression.

Download Full-text

A method for identifying environmental stimuli and genes responsible for genotype-by-environment interactions from a large-scale multi-environment data set

10.1101/2021.10.25.465681 ◽

2021 ◽

Author(s):

Akio Onogi ◽

Daisuke Sekine ◽

Akito Kaga ◽

Satoshi Nakano ◽

Tetsuya Yamada ◽

...

Keyword(s):

Large Scale ◽

Genetic Correlations ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Environmental Stimuli ◽

Genotype By Environment ◽

Genome Wide ◽

Sowing Dates ◽

Data Driven Approach

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G x E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G x E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G x E interactions in six traits including yield, flowering time, and protein content and when they were involved. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G x E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G x E interactions observed in fields.

Download Full-text

A Data-Driven Approach to Evaluate Fracturing Practice in Tight Sandstone in Changqing Field

10.2523/iptc-21821-ms ◽

2021 ◽

Author(s):

Tao Wu ◽

Hanzhi Fang ◽

Hu Sun ◽

Feifei Zhang ◽

Xi Wang ◽

...

Keyword(s):

Oil Production ◽

Gas Production ◽

Oil Field ◽

Data Driven ◽

Data Sets ◽

Effective Parameters ◽

Tight Sandstone ◽

Well Productivity ◽

Input Parameters ◽

Data Driven Approach

Abstract Unconventional reservoirs such as shale and tight sandstones that with ultra-low permeability, are becoming increasingly significant in global energy structures (Pejman T, et al., 2017). For these reservoirs, successful hydraulic fracturing is the key to extract the hydrocarbon resources efficiently and economically. However, the intrinsic mechanisms of fracturing growth in the tight formations are still unclear. In practice, fracturing design mainly depends on hypothetical models and previous experience, which leads to difficulties in evaluating the performance of the fracturing jobs. Therefore, an improved method to optimize parameters for fracturing is necessary and beneficial to the industry. In this paper, a data-driven approach is used to evaluate the factors that dominate the production rate from tight sandstone formation in Changqing Field which is the largest oil field in China. In the model, the input parameters are classified into two categories: controllable parameters (e.g. stage numbers, fracturing fluid volume) and uncontrollable parameters (e.g. formation properties), and the output parameter is the accumulated oil production of the wells. Data for more than 100 wells from different formations and zones in Changqing Field are collected for this study. First, a stepwise data mining method is used to identify the correlations between the target parameter and all the available input parameters. Then, a machine learning model is developed to predict the well productivity for a given set of input parameters accurately. The model is validated by using separate data-sets from the same field. An optimize algorithm is combined with the data-driven model to maximize the cumulative oil production for wells by tuning the controllable parameters, which provides the optimized fracturing design. By using the developed model, low productivity wells are identified and new fracturing designs are recommended to improve the well productivity. This paper is useful for understanding the effects of designed fracturing parameters on well productivity in Changqing Oilfield. Furthermore, it can be extended to other unconventional oil fields by training the model with according data sets. The method helps operators to select more effective parameters for fracturing design, and therefore reduce the operation costs for fracturing and improve the oil and gas production.

Download Full-text

Analytic Properties of Statistical Total Correlation Spectroscopy Based Information Recovery in1H NMR Metabolic Data Sets

Analytical Chemistry ◽

10.1021/ac801982h ◽

2009 ◽

Vol 81 (6) ◽

pp. 2075-2084 ◽

Cited By ~ 44

Author(s):

Alexessander Couto Alves ◽

Mattias Rantalainen ◽

Elaine Holmes ◽

Jeremy K. Nicholson ◽

Timothy M. D. Ebbels

Keyword(s):

Correlation Spectroscopy ◽

Data Sets ◽

Information Recovery ◽

Total Correlation ◽

Metabolic Data

Download Full-text