Comparing Imputation Procedures for Affymetrix Gene Expression Datasets Using MAQC Datasets

Advances in Bioinformatics ◽

10.1155/2013/790567 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Sreevidya Sadananda Sadasiva Rao ◽

Lori A. Shepherd ◽

Andrew E. Bruno ◽

Song Liu ◽

Jeffrey C. Miecznikowski

Keyword(s):

Least Squares ◽

Missing Values ◽

Nearest Neighbor ◽

Least Squares Method ◽

Microarray Quality Control ◽

Imputation Methods ◽

Error Measures ◽

Maqc Project ◽

Overall Performance ◽

Microarray Datasets

Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with is most accurate under the error measures considered. The k-nearest neighbor method with has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with has the best overall performance and k-nearest neighbor method with has the worst overall performance. These results hold true for both 5% and 10% missing values.

Download Full-text

Application of imputation methods for missing values of PM10 and O3 data: Interpolation, moving average and K-nearest neighbor methods

Environmental Health Engineering and Management ◽

10.34172/ehem.2021.25 ◽

2021 ◽

Vol 8 (3) ◽

pp. 215-226

Author(s):

Parisa Saeipourdizaj ◽

Parvin Sarbakhsh ◽

Akbar Gholampour

Keyword(s):

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Moving Average ◽

Classification And Regression Tree ◽

Coefficient Of Determination ◽

K Nearest Neighbor ◽

Imputation Methods ◽

Machine Failure

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.

Download Full-text

EVALUATION OF INTER-LABORATORY AND CROSS-PLATFORM CONCORDANCE OF DNA MICROARRAYS THROUGH DISCRIMINATING GENES AND CLASSIFIER TRANSFERABILITY

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720009004011 ◽

2009 ◽

Vol 07 (01) ◽

pp. 157-173 ◽

Cited By ~ 13

Author(s):

SHIHONG MAO ◽

CHARLES WANG ◽

GUOZHU DONG

Keyword(s):

Dna Microarrays ◽

P Value ◽

Microarray Technology ◽

Microarray Quality Control ◽

Practical Usefulness ◽

Maqc Project ◽

Cross Platform ◽

Microarray Datasets ◽

Consistency Rate

Microarray technology has great potential for improving our understanding of biological processes, medical conditions, and diseases. Often, microarray datasets are collected using different microarray platforms (provided by different companies) under different conditions in different laboratories. The cross-platform and cross-laboratory concordance of the microarray technology needs to be evaluated before it can be successfully and reliably applied in biological/clinical practice. New measures and techniques are proposed for comparing and evaluating the quality of microarray datasets generated from different platforms/laboratories. These measures and techniques are based on the following philosophy: the practical usefulness of the microarray technology may be confirmed if discriminating genes and classifiers, which are the focus of most, if not all, comparative investigations, discovered/trained from data collected in one lab/platform combination can be transferred to another lab/platform combination. The rationale is that the nondiscriminating genes might not be as strongly regulated as the discriminating genes, by the biological process of the tissue cells under study, and hence they may behave more randomly than the discriminating genes. Our experiment results, on microarray datasets generated from different platforms/laboratories using the reference mRNA samples in the Microarray Quality Control (MAQC) project, showed that DNA microarrays can produce highly repeatable data in a cross-platform cross-lab manner, when one focuses on the discriminating genes and classifiers. In our comparative study, we compare samples of one type against samples of another type; the methodology can be applied to situations where one compares one arbitrary class of data against another. Other findings include: (1) using three discriminating-gene/classifier-based methods to test the concordance between microarray datasets gave consistent results; (2) when noisy (nondiscriminating) genes were removed, the microarray datasets from different laboratories using common platform were found to be highly concordant, and the data generated using most of the commercial platforms studied here were also found to be concordant with each other; (3) several series of artificial datasets with known degree of difference were created, to establish a bridge between consistency rate and P-value, allowing us to estimate P-value if consistency rate between two datasets is known.

Download Full-text

CHOOSING APPROPRIATE IMPUTATION METHODS FOR MISSING DATA: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Journal of Al-Qadisiyah for Computer Science and Mathematics ◽

10.29304/jqcm.2019.11.2.588 ◽

2019 ◽

Vol 11 (2) ◽

pp. 65-73

Author(s):

Wisam A. Mahmood ◽

Mohammed S. Rashid ◽

Teaba Wala Aldeen ◽

Teaba Wala Aldeen

Keyword(s):

Missing Data ◽

Simulation Study ◽

Missing Values ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Decision Algorithm ◽

Imputation Methods ◽

Regression Imputation ◽

Mean Imputation

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.

Download Full-text

The Application of Robust Least Squares Method in Frequency Lock Loop Fusion for Global Navigation Satellite System Receivers

Sensors ◽

10.3390/s20041224 ◽

2020 ◽

Vol 20 (4) ◽

pp. 1224 ◽

Cited By ~ 1

Author(s):

Mengyue Han ◽

Qian Wang ◽

Yuanlan Wen ◽

Min He ◽

Xiufeng He

Keyword(s):

Least Squares ◽

Least Squares Method ◽

Satellite System ◽

Complex Environment ◽

Tracking Accuracy ◽

Proposed Model ◽

Robust Least Squares ◽

Promising Application ◽

Overall Performance ◽

Global Navigation Satellite

The tracking accuracy of a traditional Frequency Lock Loop (FLL) decreases significantly in a complex environment, thus reducing the overall performance of a satellite receiver. In order to ensure high tracking accuracy of a receiver in a complex environment, this paper proposes a new tracking loop combining the vector FLL (VFLL) with a robust least squares method, which accurately matches the weights of received signals of different qualities to ensure high positioning accuracy. The weights of received signals are selected at the signal level, not at the observation level. In this paper, the ranges of strong and weak signals of the loop are determined according to the different expressions of the distribution function at different signal strengths, and the concept of loop segmentation is introduced. The segmentation results of the FLL are taken as a basis of the weight selection, and then combined with the Institute of Geodesy and Geophysics (IGGIII) weight function to obtain the equivalent weight matrix; the experiments are conducted to prove the advantages of the proposed method over the traditional methods. The experimental results show that the proposed VFLL tracking method has strong denoising capability under both normal- signal and harsh application environment conditions. Accordingly, the proposed model has a promising application perspective.

Download Full-text

MEASURING THE PREDICTABILITY OF NOISY RECURRENT TIME SERIES

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127493000738 ◽

1993 ◽

Vol 03 (03) ◽

pp. 797-802

Author(s):

R. WAYLAND ◽

D. PICKETT ◽

D. BROMLEY ◽

A. PASSAMANTE

Keyword(s):

Time Series ◽

Least Squares ◽

Additive Noise ◽

Nearest Neighbor ◽

Least Squares Method ◽

Total Least Squares ◽

Short Time Series ◽

Forecasting Method ◽

Noisy Time Series ◽

Short Time

The effect of the chosen forecasting method on the measured predictability of a noisy recurrent time series is investigated. Situations where the length of the time series is limited, and where the level of corrupting noise is significant are emphasized. Two simple prediction methods based on explicit nearest-neighbor averages are compared to a more complicated, and computationally expensive, local linearization technique based on the method of total least squares. The comparison is made first for noise-free, and then for noisy time series. It is shown that when working with short time series in high levels of additive noise, the simple prediction schemes perform just as well as the more sophisticated total least squares method.

Download Full-text

Imputation of Missing Values in Economic and Financial Time Series Data Using Five Principal Component Analysis (PCA) Approaches

Central Bank of Nigeria Journal of Applied Statistics ◽

10.33429/cjas.10119.3/6 ◽

2019 ◽

pp. 51-73

Author(s):

Chisimkwuo John ◽

Emmanuel J. Ekpenyong ◽

Charles C. Nworu

Keyword(s):

Private Sector ◽

Least Squares ◽

Missing Values ◽

Time Series Data ◽

Mean Squared Error ◽

Forecast Error ◽

Series Data ◽

Imputation Methods ◽

Root Mean Squared Error ◽

Squared Error

This study assessed five approaches for imputing missing values. The evaluated methods include Singular Value Decomposition Imputation (svdPCA), Bayesian imputation (bPCA), Probabilistic imputation (pPCA), Non-Linear Iterative Partial Least squares imputation (nipalsPCA) and Local Least Squares imputation (llsPCA). A 5%, 10%, 15% and 20% missing data were created under a missing completely at random (MCAR) assumption using five (5) variables (Net Foreign Assets (NFA), Credit to Core Private Sector (CCP), Reserve Money (RM), Narrow Money (M1), Private Sector Demand Deposits (PSDD) from Nigeria quarterly monetary aggregate dataset from 1981 to 2019 using R-software. The data were collected from the Central Bank of Nigeria statistical bulletin. The five imputation methods were used to estimate the artificially generated missing values. The performances of the PCA imputation approaches were evaluated based on the Mean Forecast Error (MFE), Root Mean Squared Error (RMSE) and Normalized Root Mean Squared Error (NRMSE) criteria. The result suggests that the bPCA, llsPCA and pPCA methods performed better than other imputation methods with the bPCA being the more appropriate method and llsPCA, the best method as it appears to be more stable than others in terms of the proportion of missingness.

Download Full-text

Data Imputation Methods for Missing Values in the Context of Clustering

Big Data and Knowledge Sharing in Virtual Organizations - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-7519-1.ch011 ◽

2019 ◽

pp. 240-274

Author(s):

Mehmet S. Aktaş ◽

Sinan Kaplan ◽

Hasan Abacı ◽

Oya Kalipsiz ◽

Utku Ketenci ◽

...

Keyword(s):

Missing Data ◽

Expectation Maximization ◽

Missing Values ◽

Nearest Neighbor ◽

Real Life ◽

Data Imputation ◽

K Nearest Neighbor ◽

Missing Data Imputation ◽

Data Scarcity ◽

Imputation Methods

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.

Download Full-text

An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family

Genetics Research ◽

10.1017/s0016672303006414 ◽

2003 ◽

Vol 82 (2) ◽

pp. 127-138 ◽

Cited By ~ 38

Author(s):

SHIZHONG XU ◽

NENGJUN YI ◽

DAVID BURKE ◽

ANDRZEJ GALECKI ◽

RICHARD A. MILLER

Keyword(s):

Maximum Likelihood ◽

Em Algorithm ◽

Least Squares ◽

Missing Values ◽

Least Squares Method ◽

Threshold Model ◽

Maximum Likelihood Estimates ◽

Normal Equation ◽

Data Set ◽

Disease Loci

Many diseases show dichotomous phenotypic variation but do not follow a simple Mendelian pattern of inheritance. Variances of these binary diseases are presumably controlled by multiple loci and environmental variants. A least-squares method has been developed for mapping such complex disease loci by treating the binary phenotypes (0 and 1) as if they were continuous. However, the least-squares method is not recommended because of its ad hoc nature. Maximum Likelihood (ML) and Bayesian methods have also been developed for binary disease mapping by incorporating the discrete nature of the phenotypic distribution. In the ML analysis, the likelihood function is usually maximized using some complicated maximization algorithms (e.g. the Newton–Raphson or the simplex algorithm). Under the threshold model of binary disease, we develop an Expectation Maximization (EM) algorithm to solve for the maximum likelihood estimates (MLEs). The new EM algorithm is developed by treating both the unobserved genotype and the disease liability as missing values. As a result, the EM iteration equations have the same form as the normal equation system in linear regression. The EM algorithm is further modified to take into account sexual dimorphism in the linkage maps. Applying the EM-implemented ML method to a four-way-cross mouse family, we detected two regions on the fourth chromosome that have evidence of QTLs controlling the segregation of fibrosarcoma, a form of connective tissue cancer. The two QTLs explain 50–60% of the variance in the disease liability. We also applied a Bayesian method previously developed (modified to take into account sex-specific maps) to this data set and detected one additional QTL on chromosome 13 that explains another 26% of the variance of the disease liability. All the QTLs detected primarily show dominance effects.

Download Full-text