Semi-automatic exploratory data analytics for actionable discoveries through subgroup mining

Mapping Intimacies ◽

10.32469/10355/71303 ◽

2017 ◽

Author(s):

◽

Danlu Liu

Keyword(s):

Real Data ◽

Search Space ◽

Complex Data ◽

New Paradigm ◽

Quantitative Measurements ◽

Exploratory Data ◽

Index Contrast ◽

Exploratory Data Mining ◽

Contrast Measurement ◽

Complex Data Structure

People are born with the curiosity to see differences between groups. These differences are useful for understanding the root causes of certain discrepancies, such as populations and diseases. However, without prior knowledge of the data, it is extremely challenging to identify which groups differ most, let alone to discover what associations contribute to the differences. The challenges are mainly from the large searching space with complex data structure, as well as the lack of efficient quantitative measurements that are closely related to the meaning the differences. To tackle these issues, we developed a novel exploratory data mining method to identify ranked subgroups that are highly contrasted for further in-depth analyses. The underpinning components of this method include (1) a semi-greedy forward floating selection algorithm to reduce the search space, (2) a deep-exploring approach to aggregate a collection of sizable and creditable candidate feature sets for subgroups identification using in-memory computing techniques, (3) a G-index contrast measurement to guide the exploratory process and to evaluate the patterns of subgroup pairs, and (4) a ranking method to provide mined results from highly contrasted subgroups. Computational experiments were conducted on both synthesized and real data. The algorithm performed adequately in recognizing known subgroups and discovering new and unexpected subgroups. This exploratory data analysis method will provide a new paradigm to select data-driven hypotheses that will produce potentially successful actionable outcomes to tailor to subpopulations of individuals, such as consumers in E-commerce and patients in clinical trials.

Download Full-text

In-Memory Interval Joins

The VLDB Journal ◽

10.1007/s00778-020-00639-0 ◽

2021 ◽

Author(s):

Panagiotis Bouros ◽

Nikos Mamoulis ◽

Dimitrios Tsitsigkos ◽

Manolis Terrovitis

Keyword(s):

Parallel Computation ◽

State Of The Art ◽

Complex Data ◽

Plane Sweep ◽

Join Algorithm ◽

Sweep Algorithm ◽

Join Algorithms ◽

Domain Partitioning ◽

Complex Data Structure ◽

Independent Tasks

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.

Download Full-text

Exploratory Data Mining and Data Cleaning

Journal of the American Statistical Association ◽

10.1198/jasa.2006.s81 ◽

2006 ◽

Vol 101 (473) ◽

pp. 399-399 ◽

Cited By ~ 3

Author(s):

Alan F Karr

Keyword(s):

Data Mining ◽

Data Cleaning ◽

Exploratory Data ◽

Exploratory Data Mining

Download Full-text

Impediments to Exploratory Data Mining Success

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch158 ◽

2008 ◽

pp. 2566-2582

Author(s):

Jeff Zeanah

Keyword(s):

Data Mining ◽

Data Quality ◽

Exploratory Data ◽

Exploratory Data Mining

This chapter discusses impediments to exploratory data mining success. These impediments were identified based on anecdotal observations from multiple projects either reviewed or undertaken by the author and are classified into four main areas: data quality; lack of secondary or supporting data; insufficient analysis manpower; lack of openness to new results. Each is explained, and recommendations are made to prevent the impediment from interfering with the organization’s data mining efforts. The intent of the chapter is to provide an organization with a structure to anticipate these problems and to prevent the occurrence of these problems.

Download Full-text

Understanding Global Perceptions of Stress in Adulthood through Tree-Based Exploratory Data Mining

Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences ◽

10.4324/9780203403020-24 ◽

2013 ◽

pp. 393-426

Keyword(s):

Data Mining ◽

Exploratory Data ◽

Exploratory Data Mining

Download Full-text

Symbolic Data Analysis

International Journal of Signs and Semiotic Systems ◽

10.4018/ijsss.2014010101 ◽

2014 ◽

Vol 3 (1) ◽

pp. 1-9

Author(s):

Sandra Elizabeth González Císaro ◽

Héctor Oscar Nigro

Keyword(s):

Data Analysis ◽

Complex Data ◽

New Paradigm ◽

Symbolic Data Analysis ◽

Standard Data ◽

Symbolic Data ◽

The World ◽

New Type ◽

Mining Complex ◽

Internal Variation

Standard data mining techniques no longer adequately represent the complexity of the world. So, a new paradigm is necessary. Symbolic Data Analysis is a new type of data analysis that allows us to represent the complexity of reality, maintaining the internal variation and structure developed by Diday (2003). This new paradigm is based on the concept of symbolic object, which is a mathematical model of a concept. In this article the authors are going to present the fundamentals of the symbolic data analysis paradigm and the symbolic object concept. Theoretical aspects and examples allow the authors to understand the SDA paradigm as a tool for mining complex data.

Download Full-text

Descriptive Data Analysis: A Concept between Confirmatory and Exploratory Data Analysis

Methods of Information in Medicine ◽

10.1055/s-0038-1635488 ◽

1987 ◽

Vol 26 (02) ◽

pp. 77-88 ◽

Cited By ~ 120

Author(s):

K. Abt

Keyword(s):

Data Analysis ◽

Exploratory Data Analysis ◽

Analytical Models ◽

Controlled Study ◽

Complex Data ◽

Descriptive Data ◽

Relevant Effect ◽

Significance Levels ◽

Exploratory Data ◽

Confirmatory Data Analysis

SummaryConfirmatory Data Analysis (CDA) in randomized comparative (“controlled”) studies with many variables and/or time points of interest finds its limitations in the multiplicity of desired inferential statements which leads to unfeasibly small adjusted significance levels (“Bon-ferronization”) and, thereby, to unduly increased risks of not rejecting false hypotheses. In general, analytical models adequate for such complex data structures and suitable for practical use do not exist as yet. Exploratory Data Analysis (EDA), on the other hand, is usually intended to generate hypotheses and not to lead to final conclusions based on the results of the study.In this paper, it is proposed to fill the conceptual gap between CDA and EDA by “Descriptive Data Analysis” (“DDA”) which concept is mainly based on descriptive inferential statements. The results of a DDA in a controlled study are interpreted simultaneously on the basis of the investigator’s experience with respect to numerically relevant treatment effect differences and on “descriptive significances” as they appear in “near regular” patterns corresponding to the resulting relevant effect differences. A DDA may also contain confirmatory parts and/or tests on global hypotheses at a prechosen maximum risk α of erroneously rejecting true hypotheses. The paper is in parts expository and is addressed to investigators as well as statisticians.

Download Full-text

Machine Learning for Precision Medicine

Genome ◽

10.1139/gen-2020-0131 ◽

2020 ◽

Author(s):

Sarah J MacEachern ◽

Nils Daniel Forkert

Keyword(s):

Machine Learning ◽

Computer Science ◽

Precision Medicine ◽

Complex Data ◽

Unseen Data ◽

Health And Disease ◽

Diagnostic Approaches ◽

Exploratory Data ◽

Science Methodology ◽

Complex Datasets

Precision medicine is an emerging approach to clinical research and patient care that focuses on understanding and treating disease by integrating multimodal or ‘multi-omics’ data from an individual to make patient-tailored decisions. With the large and complex datasets generated using precision medicine diagnostic approaches, novel techniques to process and understand these complex data were needed. At the same time, computer science has progressed rapidly to develop techniques that enable the storage, processing, and analysis of these complex datasets, a feat that traditional statistics and early computing technologies could not accomplish. Machine learning, a branch of artificial intelligence, is a computer science methodology that aims to identify complex patterns in data that can be used to make predictions or classifications on new unseen data or for advanced exploratory data analysis. Machine learning analysis of precision medicine’s multimodal data allows for broad analysis of large datasets and ultimately a greater understanding of human health and disease. This review focuses on machine learning utilization for precision medicine’s “big data”, in the context of genetics, genomics, and beyond.

Download Full-text

New approach to the statistical analysis of cardiovascular data

Journal of Applied Physiology ◽

10.1152/japplphysiol.00772.2004 ◽

2005 ◽

Vol 98 (6) ◽

pp. 2298-2303 ◽

Cited By ~ 3

Author(s):

Michele R. Norton ◽

Richard P. Sloan ◽

Emilia Bagiella

Keyword(s):

Blood Pressure ◽

Data Structure ◽

Repeated Measures ◽

Blood Pressure Variability ◽

Statistical Tests ◽

Frequency Interval ◽

Complex Data ◽

Single Measure ◽

New Approach ◽

Complex Data Structure

Fourier-based approaches to analysis of variability of R-R intervals or blood pressure typically compute power in a given frequency band (e.g., 0.01–0.07 Hz) by aggregating the power at each constituent frequency within that band. This paper describes a new approach to the analysis of these data. We propose to partition the blood pressure variability spectrum into more narrow components by computing power in 0.01-Hz-wide bands. Therefore, instead of a single measure of variability in a specific frequency interval, we obtain several measurements. The approach generates a more complex data structure that requires a careful account of the nested repeated measures. We briefly describe a statistical methodology based on generalized estimating equations that suitably handles this more complex data structure. To illustrate the methods, we consider systolic blood pressure data collected during psychological and orthostatic challenge. We compare the results with those obtained using the conventional methods to compute blood pressure variability, and we show that our approach yields more efficient results and more powerful statistical tests. We conclude that this approach may allow a more thorough analysis of cardiovascular parameters that are measured under different experimental conditions, such as blood pressure or heart rate variability.

Download Full-text

Exploratory data mining and analysis using CONQUEST

IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing. Proceedings ◽

10.1109/pacrim.1995.519462 ◽

2002 ◽

Cited By ~ 3

Author(s):

E. Mesrobian ◽

R.R. Muntz ◽

E.C. Shek ◽

J.R. Santos ◽

J. Yi ◽

...

Keyword(s):

Data Mining ◽

Exploratory Data ◽

Exploratory Data Mining ◽

Data Mining And Analysis

Download Full-text

Professional Development: Preparing Teachers to Present Techniques of Exploratory Data Analysis

Mathematics Teaching in the Middle School ◽

10.5951/mtms.1.2.0166 ◽

1994 ◽

Vol 1 (2) ◽

pp. 166-172

Author(s):

Christine A. Browning ◽

Dwayne E. Channell ◽

Ruth A. Meyer

Keyword(s):

Professional Development ◽

Elementary School ◽

Data Analysis ◽

Exploratory Data Analysis ◽

Real Data ◽

National Council ◽

Evaluation Standards ◽

School Classroom ◽

Exploratory Data ◽

Mathematics Curricula

Why Study Statistics? We are bombarded every day with an overwhelming amount of information presented in various forms. If we are to interpret and understand the information, we must be familiar with the methods and tools of statistics. Developing an understanding and an appreciation of statistics should begin in the elementary school classroom. The National Council of Teachers of Mathematics's document Curriculum and Evaluation Standards for School Mathematics (NCTM 1989) states that the mathematics curricula for grades K-4 and 5-8 should include experiences with data analysis that involve students in collecting, organizing, describing, and interpreting data. Burrill (1990) suggests that such experiences should use real data whenever possible, progress from the concrete to the pictorial to the abstract, and use calculators and computers whenever appropriate.

Download Full-text