Missing Values in Monotone Data Sets

A context-intensive approach to imputation of missing values in data sets from networks of environmental monitors

Journal of the Air & Waste Management Association ◽

10.1080/10962247.2015.1108251 ◽

2015 ◽

Vol 66 (1) ◽

pp. 38-52 ◽

Cited By ~ 2

Author(s):

Lawrence C. Larsen ◽

Mena Shah

Keyword(s):

Missing Values ◽

Data Sets ◽

Intensive Approach

Download Full-text

Distance estimation in numerical data sets with missing values

Information Sciences ◽

10.1016/j.ins.2013.03.043 ◽

2013 ◽

Vol 240 ◽

pp. 115-128 ◽

Cited By ~ 19

Author(s):

Emil Eirola ◽

Gauthier Doquire ◽

Michel Verleysen ◽

Amaury Lendasse

Keyword(s):

Missing Values ◽

Distance Estimation ◽

Numerical Data ◽

Data Sets

Download Full-text

Interactive Graphics for Data Sets with Missing Values: MANET

Journal of Computational and Graphical Statistics ◽

10.2307/1390776 ◽

1996 ◽

Vol 5 (2) ◽

pp. 113 ◽

Cited By ~ 28

Author(s):

Antony Unwin ◽

George Hawkins ◽

Heike Hofmann ◽

Bernd Siegl

Keyword(s):

Missing Values ◽

Interactive Graphics ◽

Data Sets

Download Full-text

Image Inpainting for Missing Values in Observational Climate Datasets Using Partial Convolutions in a cuDNN

10.5194/egusphere-egu2020-21555 ◽

2020 ◽

Author(s):

Christopher Kadow ◽

David Hall ◽

Uwe Ulbrich

Keyword(s):

Missing Values ◽

Coupled Model ◽

Image Inpainting ◽

Climate Science ◽

Data Sets ◽

Learning Technology ◽

Climate Change Research ◽

The Past ◽

Combining Data ◽

High Degree

<p>Nowadays climate change research relies on climate information of the past. Historic climate records of temperature observations form global gridded datasets like HadCRUT4, which is investigated e.g. in the IPCC reports. However, record combining data-sets are sparse in the past. Even today they contain missing values. Here we show that machine learning technology can be applied to refill these missing climate values in observational datasets. We found that the technology of image inpainting using partial convolutions in a CUDA accelerated deep neural network can be trained by large Earth system model experiments from NOAA reanalysis (20CR) and the Coupled Model Intercomparison Project phase 5 (CMIP5). The derived deep neural networks are capable to independently refill added missing values of these experiments. The analysis shows a very high degree of reconstruction even in the cross-reconstruction of the trained networks on the other dataset. The network reconstruction reaches a better evaluation than other typical methods in climate science. In the end we will show the new reconstructed observational dataset HadCRUT4 and discuss further investigations.</p>

Download Full-text

Single imputation method of missing values in environmental pollution data sets

Atmospheric Environment ◽

10.1016/j.atmosenv.2006.06.040 ◽

2006 ◽

Vol 40 (38) ◽

pp. 7316-7330 ◽

Cited By ~ 40

Author(s):

A PLAIA ◽

A BONDI

Keyword(s):

Environmental Pollution ◽

Missing Values ◽

Imputation Method ◽

Data Sets ◽

Single Imputation

Download Full-text

DBSCANI: Noise-Resistant Method for Missing Value Imputation

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0172 ◽

2016 ◽

Vol 25 (3) ◽

pp. 431-440 ◽

Cited By ~ 1

Author(s):

Archana Purwar ◽

Sandeep Kumar Singh

Keyword(s):

Spatial Data ◽

Missing Values ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Data Sets ◽

Quality Of Data ◽

Data Set ◽

Dbscan Clustering ◽

Density Based Clustering

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.

Download Full-text

Comparison of Algorithms for Clustering Incomplete Data

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0007 ◽

2014 ◽

Vol 39 (2) ◽

pp. 107-127 ◽

Cited By ~ 6

Author(s):

Artur Matyja ◽

Krzysztof Siminski

Keyword(s):

Data Analysis ◽

Incomplete Data ◽

Missing Values ◽

Real Data ◽

Complete Data ◽

The Other ◽

Data Sets ◽

Missing Value ◽

Comparison Of Algorithms ◽

New Algorithms

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.

Download Full-text

DreamAI: algorithm for the imputation of proteomics data

10.1101/2020.07.21.214205 ◽

2020 ◽

Author(s):

Weiping Ma ◽

Sunkyu Kim ◽

Shrabanti Chowdhury ◽

Zhi Li ◽

Mi Yang ◽

...

Keyword(s):

Missing Values ◽

Imputation Accuracy ◽

High Dimensional ◽

Data Sets ◽

Proteomics Data ◽

Dynamic Nature ◽

Substantial Fraction ◽

Imputation Methods ◽

Data Analyses ◽

Proteomics Research

AbstractDeep proteomics profiling using labelled LC-MS/MS experiments has been proven to be powerful to study complex diseases. However, due to the dynamic nature of the discovery mass spectrometry, the generated data contain a substantial fraction of missing values. This poses great challenges for data analyses, as many tools, especially those for high dimensional data, cannot deal with missing values directly. To address this problem, the NCI-CPTAC Proteogenomics DREAM Challenge was carried out to develop effective imputation algorithms for labelled LC-MS/MS proteomics data through crowd learning. The final resulting algorithm, DreamAI, is based on an ensemble of six different imputation methods. The imputation accuracy of DreamAI, as measured by correlation, is about 15%-50% greater than existing tools among less abundant proteins, which are more vulnerable to be missed in proteomics data sets. This new tool nicely enhances data analysis capabilities in proteomics research.

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

Nursing Home Factors and Their Impact on COVID-19 Cases: A Study of Wisconsin State

Innovation in Aging ◽

10.1093/geroni/igab046.3192 ◽

2021 ◽

Vol 5 (Supplement_1) ◽

pp. 884-884

Author(s):

Mohammed Abahussain ◽

Priya Nambisan ◽

Colleen Galambos ◽

Bo Zhang ◽

Elizabeth Bukowy

Keyword(s):

Nursing Home ◽

Nursing Homes ◽

Missing Values ◽

Organizational Factors ◽

Data Sets ◽

Occupancy Rate ◽

Free Standing ◽

Staff Shortage ◽

For Profit ◽

Nurse Aides

Abstract COVID-19 has been devastating for Nursing Homes (NHs). The concentration of older adults with underlying chronic conditions inevitably made the setting highly vulnerable leading to high rates of mortality for residents. However, some nursing homes fared better than others. This study examines several quality measures and organizational factors to understand whether these factors are associated with COVID-19 cases in Wisconsin. We combined three datasets from Centers for Medicare & Medicaid Services (CMS) – the Star Rating dataset, Provider Information dataset and COVID-19 Nursing Home dataset. Data used is from the period of Jan 1 – Oct 25, 2020 for the state of Wisconsin. The analysis includes 331 free-standing NHs with no missing values from the data sets. The variables used were self-reported information on nursing home ratings, staff shortage, staff reported hours, occupancy rate, number of beds and ownership. Of the 331 NHs examined, shortages were reported of 25.4%, 31.1%, 3.2% and 15.6% of licensed nurse staff (25.4%), nurse aides (31.1%), clinical staff, (3.2%) and other staff (15.6%) Additionally, there was a significant (p<.05) positive correlation between number of beds and COVID-19 cases, and there was no statistically significant association between occupancy rate and COVID-19 cases. NHs with better star ratings were also found to have less COVID-19 cases. Interestingly, private NHs had significantly higher COVID-19 cases than for-profit and government owned NHs, a finding that is congruent with other studies in this area. Recommendations for practice will be discussed.

Download Full-text