Reading Profiles in Multi-site Data with Missingness

Mapping Intimacies ◽

10.1101/269555 ◽

2018 ◽

Author(s):

Mark A. Eckert ◽

Kenneth I. Vaden ◽

Mulugeta Gebregziabher ◽

Keyword(s):

Missing Data ◽

Reading Disability ◽

Phonological Processing ◽

Oral Language ◽

Cognitive Abilities ◽

Missing Values ◽

Independent Set ◽

Classification Error ◽

Reading Profiles

AbstractChildren with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ~5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset (n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.

Download Full-text

The Effects of Missing Data Characteristics on the Choice of Imputation Techniques

Vietnam Journal of Computer Science ◽

10.1142/s2196888820500098 ◽

2020 ◽

Vol 07 (02) ◽

pp. 161-177

Author(s):

Oyekale Abel Alade ◽

Ali Selamat ◽

Roselina Sallehuddin

Keyword(s):

Missing Data ◽

Missing Values ◽

Health Management ◽

Support Vector ◽

Multiple Imputations ◽

Original Dataset ◽

Learning Machine ◽

Elm Classifier ◽

The Right

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.

Download Full-text

Missing data: the impact of what is not there

Acta Endocrinologica ◽

10.1530/eje-20-0732 ◽

2020 ◽

Vol 183 (4) ◽

pp. E7-E9

Author(s):

Rolf H H Groenwold ◽

Olaf M Dekkers

Keyword(s):

Missing Data ◽

Clinical Research ◽

Missing Values ◽

The Impact

The validity of clinical research is potentially threatened by missing data. Any variable measured in a study can have missing values, including the exposure, the outcome, and confounders. When missing values are ignored in the analysis, only those subjects with complete records will be included in the analysis. This may lead to biased results and loss of power. We explain why missing data may lead to bias and discuss a commonly used classification of missing data.

Download Full-text

An Efficient and Effective Model to Handle Missing Data in Classification

BioMed Research International ◽

10.1155/2020/8810143 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Kamran Mehrabani-Zeinabad ◽

Marziyeh Doostfatemeh ◽

Seyyed Mohammad Taghi Ayatollahi

Keyword(s):

Missing Data ◽

Incomplete Data ◽

Missing Values ◽

Computational Time ◽

Medical Sciences ◽

Imputation Methods ◽

Simulation Based ◽

Additive Regression ◽

Incomplete Datasets

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.

Download Full-text

Linguistic Deficits in Children With Reading Disabilities

American Journal of Speech-Language Pathology ◽

10.1044/1058-0360.0603.71 ◽

1997 ◽

Vol 6 (3) ◽

pp. 71-78 ◽

Cited By ~ 37

Author(s):

Linda J. Lombardino ◽

Cynthia A. Riccio ◽

George W. Hynd ◽

Shireen B. Pinheiro

Keyword(s):

Reading Disability ◽

Phonological Processing ◽

Oral Language ◽

Developmental Reading ◽

Primary Diagnosis ◽

Control Group ◽

Adhd Group ◽

Phonological Coding ◽

Core Deficit ◽

Contrast Group

Although recent research into the nature of linguistic abilities and disabilities in children with developmental reading disorders points to phonological processing difficulties as the core deficit in this population, broader-based linguistic deficits have been described in several studies. In this study, children with a primary diagnosis of specific reading disability (RD) were compared on measures of oral language, phonological coding, reading, and spelling with a clinical contrast group of children diagnosed with attention deficit hyperactivity disorder (ADHD) and with a control group of children developing normally. The results of this study revealed that the RD group showed relatively depressed scores on measures of oral language and phonemic processing when compared with children in the ADHD group. The pattern of language deficits observed in this study clearly contributes to the converging evidence that deficient linguistic processes as measured by both phonological coding tasks and formal tests of oral language characterize the language of children with severe reading disability.

Download Full-text

Evidential classification of incomplete instance based on K-nearest centroid neighbor

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210991 ◽

2021 ◽

pp. 1-16

Author(s):

Zong-fang Ma ◽

Zhe Liu ◽

Chan Luo ◽

Lin Song

Keyword(s):

Missing Values ◽

Evidence Theory ◽

Error Rates ◽

Classification Error ◽

Specific Class ◽

Fusion Method ◽

Challenging Problem ◽

Classification Result ◽

Multiple Classification

Classification of incomplete instance is a challenging problem due to the missing features generally cause uncertainty in the classification result. A new evidential classification method of incomplete instance based on adaptive imputation thanks to the framework of evidence theory. Specifically, the missing values of different incomplete instances in test set are adaptively estimated based on Shannon entropy and K-nearest centroid neighbors (KNCNs) technology. The single or multiple edited instances (with estimations) then are classified by the chosen classifier to get single or multiple classification results for the instances with different discounting (weighting) factors, and a new adaptive global fusion method finally is proposed to unify the different discounted results. The proposed method can well capture the imprecision degree of classification by submitting the instances that are difficult to be classified into a specific class to associate the meta-class and effectively reduce the classification error rates. The effectiveness and robustness of the proposed method has been tested through four experiments with artificial and real datasets.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text

Dynamic model updating (DMU) approach for statistical learning model building with missing data

BMC Bioinformatics ◽

10.1186/s12859-021-04138-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Rahi Jain ◽

Wei Xu

Keyword(s):

Missing Data ◽

Dynamic Model ◽

Statistical Models ◽

Missing Values ◽

Model Building ◽

Model Updating ◽

Biological Data ◽

Bayesian Regression ◽

Biological Research ◽

Original Dataset

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

Bootstrap joint prediction regions for sequences of missing values in spatio-temporal datasets

Computational Statistics ◽

10.1007/s00180-021-01099-y ◽

2021 ◽

Author(s):

Maria Lucia Parrella ◽

Giuseppina Albano ◽

Cira Perna ◽

Michele La Rocca

Keyword(s):

Missing Data ◽

Missing Values ◽

Sample Selection ◽

Sample Path ◽

Temporal Relationships ◽

Empirical Performance ◽

Spatio Temporal ◽

Joint Prediction ◽

Point Forecast ◽

Prediction Regions

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.

Download Full-text

A Comparative Study of Various Methods of Handling Missing Data in UNSODA

Agriculture ◽

10.3390/agriculture11080727 ◽

2021 ◽

Vol 11 (8) ◽

pp. 727

Author(s):

Yingpeng Fu ◽

Hongjian Liao ◽

Longlong Lv

Keyword(s):

Missing Data ◽

Missing Values ◽

Soil Property ◽

Particle Density ◽

Organic Matter Content ◽

Nonparametric Tests ◽

Matter Content ◽

Support Vector ◽

Soil Database ◽

Property Data

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.

Download Full-text