scholarly journals Comparing Imputation Procedures for Affymetrix Gene Expression Datasets Using MAQC Datasets

2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Sreevidya Sadananda Sadasiva Rao ◽  
Lori A. Shepherd ◽  
Andrew E. Bruno ◽  
Song Liu ◽  
Jeffrey C. Miecznikowski

Introduction. The microarray datasets from the MicroArray Quality Control (MAQC) project have enabled the assessment of the precision, comparability of microarrays, and other various microarray analysis methods. However, to date no studies that we are aware of have reported the performance of missing value imputation schemes on the MAQC datasets. In this study, we use the MAQC Affymetrix datasets to evaluate several imputation procedures in Affymetrix microarrays. Results. We evaluated several cutting edge imputation procedures and compared them using different error measures. We randomly deleted 5% and 10% of the data and imputed the missing values using imputation tests. We performed 1000 simulations and averaged the results. The results for both 5% and 10% deletion are similar. Among the imputation methods, we observe the local least squares method with is most accurate under the error measures considered. The k-nearest neighbor method with has the highest error rate among imputation methods and error measures. Conclusions. We conclude for imputing missing values in Affymetrix microarray datasets, using the MAS 5.0 preprocessing scheme, the local least squares method with has the best overall performance and k-nearest neighbor method with has the worst overall performance. These results hold true for both 5% and 10% missing values.

2021 ◽  
Vol 8 (3) ◽  
pp. 215-226
Author(s):  
Parisa Saeipourdizaj ◽  
Parvin Sarbakhsh ◽  
Akbar Gholampour

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.


2009 ◽  
Vol 07 (01) ◽  
pp. 157-173 ◽  
Author(s):  
SHIHONG MAO ◽  
CHARLES WANG ◽  
GUOZHU DONG

Microarray technology has great potential for improving our understanding of biological processes, medical conditions, and diseases. Often, microarray datasets are collected using different microarray platforms (provided by different companies) under different conditions in different laboratories. The cross-platform and cross-laboratory concordance of the microarray technology needs to be evaluated before it can be successfully and reliably applied in biological/clinical practice. New measures and techniques are proposed for comparing and evaluating the quality of microarray datasets generated from different platforms/laboratories. These measures and techniques are based on the following philosophy: the practical usefulness of the microarray technology may be confirmed if discriminating genes and classifiers, which are the focus of most, if not all, comparative investigations, discovered/trained from data collected in one lab/platform combination can be transferred to another lab/platform combination. The rationale is that the nondiscriminating genes might not be as strongly regulated as the discriminating genes, by the biological process of the tissue cells under study, and hence they may behave more randomly than the discriminating genes. Our experiment results, on microarray datasets generated from different platforms/laboratories using the reference mRNA samples in the Microarray Quality Control (MAQC) project, showed that DNA microarrays can produce highly repeatable data in a cross-platform cross-lab manner, when one focuses on the discriminating genes and classifiers. In our comparative study, we compare samples of one type against samples of another type; the methodology can be applied to situations where one compares one arbitrary class of data against another. Other findings include: (1) using three discriminating-gene/classifier-based methods to test the concordance between microarray datasets gave consistent results; (2) when noisy (nondiscriminating) genes were removed, the microarray datasets from different laboratories using common platform were found to be highly concordant, and the data generated using most of the commercial platforms studied here were also found to be concordant with each other; (3) several series of artificial datasets with known degree of difference were created, to establish a bridge between consistency rate and P-value, allowing us to estimate P-value if consistency rate between two datasets is known.


Author(s):  
Wisam A. Mahmood ◽  
Mohammed S. Rashid ◽  
Teaba Wala Aldeen ◽  
Teaba Wala Aldeen

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.


Sensors ◽  
2020 ◽  
Vol 20 (4) ◽  
pp. 1224 ◽  
Author(s):  
Mengyue Han ◽  
Qian Wang ◽  
Yuanlan Wen ◽  
Min He ◽  
Xiufeng He

The tracking accuracy of a traditional Frequency Lock Loop (FLL) decreases significantly in a complex environment, thus reducing the overall performance of a satellite receiver. In order to ensure high tracking accuracy of a receiver in a complex environment, this paper proposes a new tracking loop combining the vector FLL (VFLL) with a robust least squares method, which accurately matches the weights of received signals of different qualities to ensure high positioning accuracy. The weights of received signals are selected at the signal level, not at the observation level. In this paper, the ranges of strong and weak signals of the loop are determined according to the different expressions of the distribution function at different signal strengths, and the concept of loop segmentation is introduced. The segmentation results of the FLL are taken as a basis of the weight selection, and then combined with the Institute of Geodesy and Geophysics (IGGIII) weight function to obtain the equivalent weight matrix; the experiments are conducted to prove the advantages of the proposed method over the traditional methods. The experimental results show that the proposed VFLL tracking method has strong denoising capability under both normal- signal and harsh application environment conditions. Accordingly, the proposed model has a promising application perspective.


1993 ◽  
Vol 03 (03) ◽  
pp. 797-802
Author(s):  
R. WAYLAND ◽  
D. PICKETT ◽  
D. BROMLEY ◽  
A. PASSAMANTE

The effect of the chosen forecasting method on the measured predictability of a noisy recurrent time series is investigated. Situations where the length of the time series is limited, and where the level of corrupting noise is significant are emphasized. Two simple prediction methods based on explicit nearest-neighbor averages are compared to a more complicated, and computationally expensive, local linearization technique based on the method of total least squares. The comparison is made first for noise-free, and then for noisy time series. It is shown that when working with short time series in high levels of additive noise, the simple prediction schemes perform just as well as the more sophisticated total least squares method.


Author(s):  
Chisimkwuo John ◽  
Emmanuel J. Ekpenyong ◽  
Charles C. Nworu

This study assessed five approaches for imputing missing values. The evaluated methods include Singular Value Decomposition Imputation (svdPCA), Bayesian imputation (bPCA), Probabilistic imputation (pPCA), Non-Linear Iterative Partial Least squares imputation (nipalsPCA) and Local Least Squares imputation (llsPCA). A 5%, 10%, 15% and 20% missing data were created under a missing completely at random (MCAR) assumption using five (5) variables (Net Foreign Assets (NFA), Credit to Core Private Sector (CCP), Reserve Money (RM), Narrow Money (M1), Private Sector Demand Deposits (PSDD) from Nigeria quarterly monetary aggregate dataset from 1981 to 2019 using R-software. The data were collected from the Central Bank of Nigeria statistical bulletin. The five imputation methods were used to estimate the artificially generated missing values. The performances of the PCA imputation approaches were evaluated based on the Mean Forecast Error (MFE), Root Mean Squared Error (RMSE) and Normalized Root Mean Squared Error (NRMSE) criteria. The result suggests that the bPCA, llsPCA and pPCA methods performed better than other imputation methods with the bPCA being the more appropriate method and llsPCA, the best method as it appears to be more stable than others in terms of the proportion of missingness.


Author(s):  
Mehmet S. Aktaş ◽  
Sinan Kaplan ◽  
Hasan Abacı ◽  
Oya Kalipsiz ◽  
Utku Ketenci ◽  
...  

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.


2003 ◽  
Vol 82 (2) ◽  
pp. 127-138 ◽  
Author(s):  
SHIZHONG XU ◽  
NENGJUN YI ◽  
DAVID BURKE ◽  
ANDRZEJ GALECKI ◽  
RICHARD A. MILLER

Many diseases show dichotomous phenotypic variation but do not follow a simple Mendelian pattern of inheritance. Variances of these binary diseases are presumably controlled by multiple loci and environmental variants. A least-squares method has been developed for mapping such complex disease loci by treating the binary phenotypes (0 and 1) as if they were continuous. However, the least-squares method is not recommended because of its ad hoc nature. Maximum Likelihood (ML) and Bayesian methods have also been developed for binary disease mapping by incorporating the discrete nature of the phenotypic distribution. In the ML analysis, the likelihood function is usually maximized using some complicated maximization algorithms (e.g. the Newton–Raphson or the simplex algorithm). Under the threshold model of binary disease, we develop an Expectation Maximization (EM) algorithm to solve for the maximum likelihood estimates (MLEs). The new EM algorithm is developed by treating both the unobserved genotype and the disease liability as missing values. As a result, the EM iteration equations have the same form as the normal equation system in linear regression. The EM algorithm is further modified to take into account sexual dimorphism in the linkage maps. Applying the EM-implemented ML method to a four-way-cross mouse family, we detected two regions on the fourth chromosome that have evidence of QTLs controlling the segregation of fibrosarcoma, a form of connective tissue cancer. The two QTLs explain 50–60% of the variance in the disease liability. We also applied a Bayesian method previously developed (modified to take into account sex-specific maps) to this data set and detected one additional QTL on chromosome 13 that explains another 26% of the variance of the disease liability. All the QTLs detected primarily show dominance effects.


1980 ◽  
Vol 59 (9) ◽  
pp. 8
Author(s):  
D.E. Turnbull

Sign in / Sign up

Export Citation Format

Share Document