Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Mapping Intimacies ◽

10.1101/260281 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kieu Trinh Do ◽

Simone Wahl ◽

Johannes Raffler ◽

Sophie Molnos ◽

Michael Laimighofer ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Statistical Power ◽

Missing Values ◽

Biological Evaluation ◽

List Type ◽

Robust Performance ◽

Metabolomics Data ◽

Imputation Methods ◽

Biochemical Pathways

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

Characterizing and Managing Missing Structured Data in Electronic Health Records

10.1101/167858 ◽

2017 ◽

Author(s):

Brett K. Beaulieu-Jones ◽

Daniel R. Lavage ◽

John W. Snyder ◽

Jason H. Moore ◽

Sarah A Pendergrass ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Structured Data ◽

Health Record ◽

Data Types ◽

Health Records ◽

Imputation Methods ◽

Electronic Health ◽

Evaluation Of Methods

ABSTRACTMissing data is a challenge for all studies; however, this is especially true for electronic health record (EHR) based analyses. Failure to appropriately consider missing data can lead to biased results. Here, we provide detailed procedures for when and how to conduct imputation of EHR data. We demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical lab measures from 602,366 patients in the Geisinger Health System EHR. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness. Our results show that several methods including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute consistently imputed missing values with low error; however, only a subset of the MICE methods were suitable for multiple imputation. The analyses described provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs and all of our methods and code are publicly available.

Download Full-text

Multiple Imputation for Missing Values in Homicide Incident Data: An Evaluation Using Unique Test Data

Homicide Studies ◽

10.1177/1088767918778309 ◽

2018 ◽

Vol 22 (4) ◽

pp. 391-409

Author(s):

John M. Roberts ◽

Aki Roberts ◽

Tim Wadsworth

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Actual Data ◽

Regression Coefficients ◽

Similar Data ◽

Missing Information ◽

Imputation Methods ◽

Unique Data ◽

Incident Reports

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.

Download Full-text

Recursive Partitioning Methods for Data Imputation in the Context of Item Response Theory: A Monte Carlo Simulation

Psicológica Journal ◽

10.2478/psicolj-2018-0005 ◽

2018 ◽

Vol 39 (1) ◽

pp. 88-117 ◽

Cited By ~ 1

Author(s):

Julianne M. Edwards ◽

W. Holmes Finch

Keyword(s):

Missing Data ◽

Item Response Theory ◽

Multiple Imputation ◽

Item Response ◽

Missing Values ◽

Recursive Partitioning ◽

Data Imputation ◽

Response Theory ◽

Imputation Methods ◽

Missing Responses

AbstractMissing data is a common problem faced by psychometricians and measurement professionals. To address this issue, there are a number of techniques that have been proposed to handle missing data regarding Item Response Theory. These methods include several types of data imputation methods - corrected item mean substitution imputation, response function imputation, multiple imputation, and the EM algorithm, as well as approaches that do not rely on the imputation of missing values - treating the item as not presented, coding missing responses as incorrect, or as fractionally correct. Of these methods, even though multiple imputation has demonstrated the best performance in prior research, higher MAE was still present. Given this higher model parameter estimation MAE for even the best performing missing data methods, this simulation study’s goal was to explore the performance of a set of potentially promising data imputation methods based on recursive partitioning. Results of this study demonstrated that approaches that combine multivariate imputation by chained equations and recursive partitioning algorithms yield data with relatively low estimation MAE for both item difficulty and item discrimination. Implications of these findings are discussed.

Download Full-text

Assessment of label-free quantification and missing value imputation for proteomics in non-human primates

10.1101/2021.07.30.454221 ◽

2021 ◽

Author(s):

Zeeshan Hamid ◽

Kip D. Zimmerman ◽

Hector Guillen-Ahlers ◽

Cun Li ◽

Peter Nathanielsz ◽

...

Keyword(s):

Missing Data ◽

Search Engine ◽

Statistical Power ◽

Missing Values ◽

Biological Information ◽

Label Free ◽

Proteomics Data ◽

Imputation Methods ◽

Label Free Quantification ◽

Free Quantification

Introduction: Reliable and effective label-free quantification (LFQ) analyses are dependent not only on the method of data acquisition in the mass spectrometer, but also on the downstream data processing, including software tools, query database, data normalization and imputation. In non-human primates (NHP), LFQ is challenging because the query databases for NHP are limited since the genomes of these species are not comprehensively annotated. This invariably results in limited discovery of proteins and associated Post Translational Modifications (PTMs) and a higher fraction of missing data points. While identification of fewer proteins and PTMs due to database limitations can negatively impact uncovering important and meaningful biological information, missing data also limits downstream analyses (e.g., multivariate analyses), decreases statistical power, biases statistical inference, and makes biological interpretation of the data more challenging. In this study we attempted to address both issues: first, we used the MetaMorphues proteomics search engine to counter the limits of NHP query databases and maximize the discovery of proteins and associated PTMs, and second, we evaluated different imputation methods for accurate data inference. Results: Using the MetaMorpheus proteomics search engine we obtained quantitative data for 1,622 proteins and 10,634 peptides including 58 different PTMs (biological, metal and artifacts) across a diverse age range of NHP brain frontal cortex. However, among the 1,622 proteins identified, only 293 proteins were quantified across all samples with no missing values, emphasizing the importance of implementing an accurate and statically valid imputation method to fill in missing data. In our imputation analysis we demonstrate that Single Imputation methods that borrow information from correlated proteins such as Generalized Ridge Regression (GRR), Random Forest (RF), local least squares (LLS), and a Bayesian Principal Component Analysis methods (BPCA), are able to estimate missing protein abundance values with great accuracy. Conclusions: Overall, this study offers a detailed comparative analysis of LFQ data generated in NHP and proposes strategies for improved LFQ in NHP proteomics data.

Download Full-text

The effects of model based missing data methods on guessing parameter in case of ignorable missing data

Pegem Eğitim ve Öğretim Dergisi ◽

10.14527/pegegog.2018.007 ◽

2017 ◽

Vol 8 (1) ◽

pp. 155-172

Author(s):

Duygu Koçak

Keyword(s):

Missing Data ◽

Em Algorithm ◽

Multiple Imputation ◽

Reference Point ◽

Missing Values ◽

Data Sets ◽

Sample Sizes ◽

Imputation Methods ◽

Model Based ◽

Random Mechanism

The present study aims to investigate the effects of model based missing data methods on guessing parameter in case of ignorable missing data. For this purpose, data based on Item Response Theory with 3 parameters logistic model were created in sample sizes of 500, 1000 and 3000; and then, missing values at random and missing values at completely random were created in ratios of 2.00%, 5.00% and 10.00%. These missing values were completed using expectation'maximization (EM) algorithm and multiple imputation methods. It was concluded that the performance of EM algorithm and multiple imputation methods was efficient depending on the rate of missing values on the data sets with missing values completely at random. When the missing value rate was 2.00%, both methods performed well in all sample sizes; however, they moved away from reference point as the number of missing values increased. On the other hand, it was also found that when the sample size was 3000, the cuts were closer to reference point even when the number of missing values was high. As for missing values at random mechanism, it was observed that both methods performed efficiently on guessing parameter when the number of missing values was low. Yet, this performance deteriorated considerably as the number of missing values increased. Both EM algorithm and multiple imputation methods did not perform effectively on guessing parameter in missing values at random mechanism.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text

Missing data in longitudinal studies: Comparison of multiple imputation methods in a real clinical setting

Journal of Evaluation in Clinical Practice ◽

10.1111/jep.13376 ◽

2020 ◽

Author(s):

Rosalba Rosato ◽

Eva Pagano ◽

Silvia Testa ◽

Paolo Zola ◽

Daniela di Cuonzo

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Longitudinal Studies ◽

Clinical Setting ◽

Imputation Methods

Download Full-text