scholarly journals Semi-automatic exploratory data analytics for actionable discoveries through subgroup mining

2017 ◽  
Author(s):  
◽  
Danlu Liu

People are born with the curiosity to see differences between groups. These differences are useful for understanding the root causes of certain discrepancies, such as populations and diseases. However, without prior knowledge of the data, it is extremely challenging to identify which groups differ most, let alone to discover what associations contribute to the differences. The challenges are mainly from the large searching space with complex data structure, as well as the lack of efficient quantitative measurements that are closely related to the meaning the differences. To tackle these issues, we developed a novel exploratory data mining method to identify ranked subgroups that are highly contrasted for further in-depth analyses. The underpinning components of this method include (1) a semi-greedy forward floating selection algorithm to reduce the search space, (2) a deep-exploring approach to aggregate a collection of sizable and creditable candidate feature sets for subgroups identification using in-memory computing techniques, (3) a G-index contrast measurement to guide the exploratory process and to evaluate the patterns of subgroup pairs, and (4) a ranking method to provide mined results from highly contrasted subgroups. Computational experiments were conducted on both synthesized and real data. The algorithm performed adequately in recognizing known subgroups and discovering new and unexpected subgroups. This exploratory data analysis method will provide a new paradigm to select data-driven hypotheses that will produce potentially successful actionable outcomes to tailor to subpopulations of individuals, such as consumers in E-commerce and patients in clinical trials.

2021 ◽  
Author(s):  
Panagiotis Bouros ◽  
Nikos Mamoulis ◽  
Dimitrios Tsitsigkos ◽  
Manolis Terrovitis

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.


2008 ◽  
pp. 2566-2582
Author(s):  
Jeff Zeanah

This chapter discusses impediments to exploratory data mining success. These impediments were identified based on anecdotal observations from multiple projects either reviewed or undertaken by the author and are classified into four main areas: data quality; lack of secondary or supporting data; insufficient analysis manpower; lack of openness to new results. Each is explained, and recommendations are made to prevent the impediment from interfering with the organization’s data mining efforts. The intent of the chapter is to provide an organization with a structure to anticipate these problems and to prevent the occurrence of these problems.


2014 ◽  
Vol 3 (1) ◽  
pp. 1-9
Author(s):  
Sandra Elizabeth González Císaro ◽  
Héctor Oscar Nigro

Standard data mining techniques no longer adequately represent the complexity of the world. So, a new paradigm is necessary. Symbolic Data Analysis is a new type of data analysis that allows us to represent the complexity of reality, maintaining the internal variation and structure developed by Diday (2003). This new paradigm is based on the concept of symbolic object, which is a mathematical model of a concept. In this article the authors are going to present the fundamentals of the symbolic data analysis paradigm and the symbolic object concept. Theoretical aspects and examples allow the authors to understand the SDA paradigm as a tool for mining complex data.


1987 ◽  
Vol 26 (02) ◽  
pp. 77-88 ◽  
Author(s):  
K. Abt

SummaryConfirmatory Data Analysis (CDA) in randomized comparative (“controlled”) studies with many variables and/or time points of interest finds its limitations in the multiplicity of desired inferential statements which leads to unfeasibly small adjusted significance levels (“Bon-ferronization”) and, thereby, to unduly increased risks of not rejecting false hypotheses. In general, analytical models adequate for such complex data structures and suitable for practical use do not exist as yet. Exploratory Data Analysis (EDA), on the other hand, is usually intended to generate hypotheses and not to lead to final conclusions based on the results of the study.In this paper, it is proposed to fill the conceptual gap between CDA and EDA by “Descriptive Data Analysis” (“DDA”) which concept is mainly based on descriptive inferential statements. The results of a DDA in a controlled study are interpreted simultaneously on the basis of the investigator’s experience with respect to numerically relevant treatment effect differences and on “descriptive significances” as they appear in “near regular” patterns corresponding to the resulting relevant effect differences. A DDA may also contain confirmatory parts and/or tests on global hypotheses at a prechosen maximum risk α of erroneously rejecting true hypotheses. The paper is in parts expository and is addressed to investigators as well as statisticians.


Genome ◽  
2020 ◽  
Author(s):  
Sarah J MacEachern ◽  
Nils Daniel Forkert

Precision medicine is an emerging approach to clinical research and patient care that focuses on understanding and treating disease by integrating multimodal or ‘multi-omics’ data from an individual to make patient-tailored decisions. With the large and complex datasets generated using precision medicine diagnostic approaches, novel techniques to process and understand these complex data were needed. At the same time, computer science has progressed rapidly to develop techniques that enable the storage, processing, and analysis of these complex datasets, a feat that traditional statistics and early computing technologies could not accomplish. Machine learning, a branch of artificial intelligence, is a computer science methodology that aims to identify complex patterns in data that can be used to make predictions or classifications on new unseen data or for advanced exploratory data analysis. Machine learning analysis of precision medicine’s multimodal data allows for broad analysis of large datasets and ultimately a greater understanding of human health and disease. This review focuses on machine learning utilization for precision medicine’s “big data”, in the context of genetics, genomics, and beyond.


2005 ◽  
Vol 98 (6) ◽  
pp. 2298-2303 ◽  
Author(s):  
Michele R. Norton ◽  
Richard P. Sloan ◽  
Emilia Bagiella

Fourier-based approaches to analysis of variability of R-R intervals or blood pressure typically compute power in a given frequency band (e.g., 0.01–0.07 Hz) by aggregating the power at each constituent frequency within that band. This paper describes a new approach to the analysis of these data. We propose to partition the blood pressure variability spectrum into more narrow components by computing power in 0.01-Hz-wide bands. Therefore, instead of a single measure of variability in a specific frequency interval, we obtain several measurements. The approach generates a more complex data structure that requires a careful account of the nested repeated measures. We briefly describe a statistical methodology based on generalized estimating equations that suitably handles this more complex data structure. To illustrate the methods, we consider systolic blood pressure data collected during psychological and orthostatic challenge. We compare the results with those obtained using the conventional methods to compute blood pressure variability, and we show that our approach yields more efficient results and more powerful statistical tests. We conclude that this approach may allow a more thorough analysis of cardiovascular parameters that are measured under different experimental conditions, such as blood pressure or heart rate variability.


1994 ◽  
Vol 1 (2) ◽  
pp. 166-172
Author(s):  
Christine A. Browning ◽  
Dwayne E. Channell ◽  
Ruth A. Meyer

Why Study Statistics? We are bombarded every day with an overwhelming amount of information presented in various forms. If we are to interpret and understand the information, we must be familiar with the methods and tools of statistics. Developing an understanding and an appreciation of statistics should begin in the elementary school classroom. The National Council of Teachers of Mathematics's document Curriculum and Evaluation Standards for School Mathematics (NCTM 1989) states that the mathematics curricula for grades K-4 and 5-8 should include experiences with data analysis that involve students in collecting, organizing, describing, and interpreting data. Burrill (1990) suggests that such experiences should use real data whenever possible, progress from the concrete to the pictorial to the abstract, and use calculators and computers whenever appropriate.


Sign in / Sign up

Export Citation Format

Share Document