scholarly journals -Plot for Testing Spherical Symmetry for High-Dimensional Data with a Small Sample Size

2012 ◽  
Vol 2012 ◽  
pp. 1-18
Author(s):  
Jiajuan Liang

High-dimensional data with a small sample size, such as microarray data and image data, are commonly encountered in some practical problems for which many variables have to be measured but it is too costly or time consuming to repeat the measurements for many times. Analysis of this kind of data poses a great challenge for statisticians. In this paper, we develop a new graphical method for testing spherical symmetry that is especially suitable for high-dimensional data with small sample size. The new graphical method associated with the local acceptance regions can provide a quick visual perception on the assumption of spherical symmetry. The performance of the new graphical method is demonstrated by a Monte Carlo study and illustrated by a real data set.

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Jing Zhang ◽  
Guang Lu ◽  
Jiaquan Li ◽  
Chuanwen Li

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.


2021 ◽  
Author(s):  
Xin Chen ◽  
Qingrun Zhang ◽  
Thierry Chekouo

Abstract Background: DNA methylations in critical regions are highly involved in cancer pathogenesis and drug response. However, to identify causal methylations out of a large number of potential polymorphic DNA methylation sites is challenging. This high-dimensional data brings two obstacles: first, many established statistical models are not scalable to so many features; second, multiple-test and overfitting become serious. To this end, a method to quickly filter candidate sites to narrow down targets for downstream analyses is urgently needed. Methods: BACkPAy is a pre-screening Bayesian approach to detect biological meaningful clusters of potential differential methylation levels with small sample size. BACkPAy prioritizes potentially important biomarkers by the Bayesian false discovery rate (FDR) approach. It filters non-informative sites (i.e. non-differential) with flat methylation pattern levels accross experimental conditions. In this work, we applied BACkPAy to a genome-wide methylation dataset with 3 tissue types and each type contains 3 gastric cancer samples. We also applied LIMMA (Linear Models for Microarray and RNA-Seq Data) to compare its results with what we achieved by BACkPAy. Then, Cox proportional hazards regression models were utilized to visualize prognostics significant markers with The Cancer Genome Atlas (TCGA) data for survival analysis. Results: Using BACkPAy, we identified 8 biological meaningful clusters/groups of differential probes from the DNA methylation dataset. Using TCGA data, we also identified five prognostic genes (i.e. predictive to the progression of gastric cancer) that contain some differential methylation probes, whereas no significant results was identified using the Benjamin-Hochberg FDR in LIMMA. Conclusions: We showed the importance of using BACkPAy for the analysis of DNA methylation data with extremely small sample size in gastric cancer. We revealed that RDH13, CLDN11, TMTC1, UCHL1 and FOXP2 can serve as predictive biomarkers for gastric cancer treatment and the promoter methylation level of these five genes in serum could have prognostic and diagnostic functions in gastric cancer patients.


Author(s):  
Carlos Eduardo Thomaz ◽  
Vagner do Amaral ◽  
Gilson Antonio Giraldi ◽  
Edson Caoru Kitani ◽  
João Ricardo Sato ◽  
...  

This chapter describes a multi-linear discriminant method of constructing and quantifying statistically significant changes on human identity photographs. The approach is based on a general multivariate two-stage linear framework that addresses the small sample size problem in high-dimensional spaces. Starting with a 2D data set of frontal face images, the authors determine a most characteristic direction of change by organizing the data according to the patterns of interest. These experiments on publicly available face image sets show that the multi-linear approach does produce visually plausible results for gender, facial expression and aging facial changes in a simple and efficient way. The authors believe that such approach could be widely applied for modeling and reconstruction in face recognition and possibly in identifying subjects after a lapse of time.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Zhihua Wang ◽  
Yongbo Zhang ◽  
Huimin Fu

Reasonable prediction makes significant practical sense to stochastic and unstable time series analysis with small or limited sample size. Motivated by the rolling idea in grey theory and the practical relevance of very short-term forecasting or 1-step-ahead prediction, a novel autoregressive (AR) prediction approach with rolling mechanism is proposed. In the modeling procedure, a new developed AR equation, which can be used to model nonstationary time series, is constructed in each prediction step. Meanwhile, the data window, for the next step ahead forecasting, rolls on by adding the most recent derived prediction result while deleting the first value of the former used sample data set. This rolling mechanism is an efficient technique for its advantages of improved forecasting accuracy, applicability in the case of limited and unstable data situations, and requirement of little computational effort. The general performance, influence of sample size, nonlinearity dynamic mechanism, and significance of the observed trends, as well as innovation variance, are illustrated and verified with Monte Carlo simulations. The proposed methodology is then applied to several practical data sets, including multiple building settlement sequences and two economic series.


2016 ◽  
Vol 6 (3) ◽  
pp. 365-374 ◽  
Author(s):  
Yuanjie Zhi ◽  
Dongmei Fu ◽  
Hanling Wang

Purpose The purpose of this paper is to present a new model which combines the non-equidistant GM(1,1) model with GCHM_WBO (generalized contra-harmonic mean (GCHM); weakening buffer operator (WBO)). The authors use the model to solve the deadlock that for a large number of non-equidistant corrosion rate, it is difficult to establish a reasonable prediction model and improve the prediction accuracy. Design/methodology/approach This research consists of three parts: non-equidistant GM(1,1) model, GCHM_WBO operator, and the optimization of morphing parameter (contained in GCHM, control the intensity of the weakening operator). The methodology is explained as follows. First, the authors built a non-equidistant GM(1,1) model with GCHM_WBO weakened data, of which morphing parameter was randomly selected. Next, the authors calculated the error between prediction data of model and the real data, and adjusted the morphing parameter according to the error and property of GCHM. Then, the authors generated a new non-equidistant GM(1,1) based on new morphing parameter, and repeated the previous step until the termination condition was satisfied. Finally, the model with appropriate morphing parameter was used to implement the prediction of new data. Findings This paper finds a property of GCHM, which is a monotonic increasing function of morphing parameter in some specific conditions. Based on the property and the fixed point axiom of WBO, an algorithm was designed to search an appropriate morphing parameter. The appropriate morphing parameter was implemented for the purpose of improving the accuracy of the model. The model was applied to predict the corrosion rate of six steels at Guangzhou experimental station. The results showed that the proposed method can get more accuracy in prediction capability compared to the models with the original data and AWBO weakened data. The method is applicable to long-term forecasts in case of data scarcity. Practical implications Corrosion will cause huge economic loss to a country; therefore, it is important to judge the remaining useful life of a material or equipment; the foundation for judgement of which is the prediction of material corrosion rate. However, the prediction of corrosion rate is very difficult because of corrosion data’s features, such as small sample size, non-equidistant, etc. The proposed method can be used to implement long-term forecast of corrosion data with only one sample and non-equidistant samples. Originality/value This paper presented a model which combines the non-equidistant GM(1,1) model with GCHM_WBO to handle the problem of long-term forecasting of corrosion data. In the modelling process, the proposed morphing parameter searched through algorithm can improve the prediction accuracy of the model. Therefore, the model can provide effective and reliable result when data are of a small sample size and non-equidistant.


2021 ◽  
Vol 12 ◽  
Author(s):  
Xin Chen ◽  
Qingrun Zhang ◽  
Thierry Chekouo

DNA methylations in critical regions are highly involved in cancer pathogenesis and drug response. However, to identify causal methylations out of a large number of potential polymorphic DNA methylation sites is challenging. This high-dimensional data brings two obstacles: first, many established statistical models are not scalable to so many features; second, multiple-test and overfitting become serious. To this end, a method to quickly filter candidate sites to narrow down targets for downstream analyses is urgently needed. BACkPAy is a pre-screening Bayesian approach to detect biological meaningful patterns of potential differential methylation levels with small sample size. BACkPAy prioritizes potentially important biomarkers by the Bayesian false discovery rate (FDR) approach. It filters non-informative sites (i.e., non-differential) with flat methylation pattern levels across experimental conditions. In this work, we applied BACkPAy to a genome-wide methylation dataset with three tissue types and each type contains three gastric cancer samples. We also applied LIMMA (Linear Models for Microarray and RNA-Seq Data) to compare its results with what we achieved by BACkPAy. Then, Cox proportional hazards regression models were utilized to visualize prognostics significant markers with The Cancer Genome Atlas (TCGA) data for survival analysis. Using BACkPAy, we identified eight biological meaningful patterns/groups of differential probes from the DNA methylation dataset. Using TCGA data, we also identified five prognostic genes (i.e., predictive to the progression of gastric cancer) that contain some differential methylation probes, whereas no significant results was identified using the Benjamin-Hochberg FDR in LIMMA. We showed the importance of using BACkPAy for the analysis of DNA methylation data with extremely small sample size in gastric cancer. We revealed that RDH13, CLDN11, TMTC1, UCHL1, and FOXP2 can serve as predictive biomarkers for gastric cancer treatment and the promoter methylation level of these five genes in serum could have prognostic and diagnostic functions in gastric cancer patients.


Sign in / Sign up

Export Citation Format

Share Document