Selecting Genes by Test Statistics

Dechang Chen; Zhenqiu Liu; Xiaobin Ma; Dong Hua

doi:10.1155/jbb.2005.132

Selecting Genes by Test Statistics

Journal of Biomedicine and Biotechnology ◽

10.1155/jbb.2005.132 ◽

2005 ◽

Vol 2005 (2) ◽

pp. 132-138 ◽

Cited By ~ 29

Author(s):

Dechang Chen ◽

Zhenqiu Liu ◽

Xiaobin Ma ◽

Dong Hua

Keyword(s):

Gene Selection ◽

Expression Data ◽

Test Statistics ◽

Test Statistic ◽

F Test ◽

Class Prediction ◽

Welch Test ◽

Equal Variance ◽

Alternative Test ◽

Microarray Datasets

Gene selection is an important issue in analyzing multiclass microarray data. Among many proposed selection methods, the traditional ANOVA F test statistic has been employed to identify informative genes for both class prediction (classification) and discovery problems. However, the F test statistic assumes an equal variance. This assumption may not be realistic for gene expression data. This paper explores other alternative test statistics which can handle heterogeneity of the variances. We study five such test statistics, which include Brown-Forsythe test statistic and Welch test statistic. Their performance is evaluated and compared with that of F statistic over different classification methods applied to publicly available microarray datasets.

Download Full-text

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

10.20944/preprints202009.0699.v1 ◽

2020 ◽

Author(s):

Samarendra Das ◽

Shesh N. Rai

Keyword(s):

Gene Expression ◽

Statistical Approach ◽

Gene Selection ◽

Statistical Significance ◽

High Dimensional ◽

Support Vector ◽

Expression Data ◽

Test Statistic ◽

Biologically Relevant ◽

Selection Of

Selection of biologically relevant genes from high dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was done on a single high-dimensional expression data, which leads to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining Support Vector Machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes are selected through statistical significance values computed using a non-parametric test statistic under a bootstrap based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e. subject classification, biological relevant criteria based on quantitative trait loci, and gene ontology. Our analytical results showed that the proposed approach selects genes that are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter, and wrapper methods of gene selection.

Download Full-text

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

Entropy ◽

10.3390/e22111205 ◽

2020 ◽

Vol 22 (11) ◽

pp. 1205

Author(s):

Samarendra Das ◽

Shesh N. Rai

Keyword(s):

Gene Expression ◽

Statistical Approach ◽

Gene Selection ◽

Statistical Significance ◽

High Dimensional ◽

Support Vector ◽

Expression Data ◽

Test Statistic ◽

Biologically Relevant ◽

Selection Of

Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.

Download Full-text

Effects of kinship correction on inflation of genetic interaction statistics in commonly used mouse populations

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab131 ◽

2021 ◽

Author(s):

Anna L Tyler ◽

Baha El Kassaby ◽

Georgi Kolishovski ◽

Jake Emerson ◽

Ann E Wells ◽

...

Keyword(s):

Mixed Model ◽

Linear Mixed Model ◽

Genetic Interaction ◽

Recombinant Inbred Lines ◽

Test Statistics ◽

Test Statistic ◽

Kinship Matrix ◽

Main Effect ◽

Main Effects ◽

Interaction Test

Abstract It is well understood that variation in relatedness among individuals, or kinship, can lead to false genetic associations. Multiple methods have been developed to adjust for kinship while maintaining power to detect true associations. However, relatively unstudied, are the effects of kinship on genetic interaction test statistics. Here we performed a survey of kinship effects on studies of six commonly used mouse populations. We measured inflation of main effect test statistics, genetic interaction test statistics, and interaction test statistics reparametrized by the Combined Analysis of Pleiotropy and Epistasis (CAPE). We also performed linear mixed model (LMM) kinship corrections using two types of kinship matrix: an overall kinship matrix calculated from the full set of genotyped markers, and a reduced kinship matrix, which left out markers on the chromosome(s) being tested. We found that test statistic inflation varied across populations and was driven largely by linkage disequilibrium. In contrast, there was no observable inflation in the genetic interaction test statistics. CAPE statistics were inflated at a level in between that of the main effects and the interaction effects. The overall kinship matrix overcorrected the inflation of main effect statistics relative to the reduced kinship matrix. The two types of kinship matrices had similar effects on the interaction statistics and CAPE statistics, although the overall kinship matrix trended toward a more severe correction. In conclusion, we recommend using a LMM kinship correction for both main effects and genetic interactions and further recommend that the kinship matrix be calculated from a reduced set of markers in which the chromosomes being tested are omitted from the calculation. This is particularly important in populations with substantial population structure, such as recombinant inbred lines in which genomic replicates are used.

Download Full-text

Gene Selection and Classification in Microarray Datasets using a Hybrid Approach of PCC-BPSO/GA with Multi Classifiers

Journal of Computer Science ◽

10.3844/jcssp.2018.868.880 ◽

2018 ◽

Vol 14 (6) ◽

pp. 868-880 ◽

Cited By ~ 3

Author(s):

Shilan S. Hameed ◽

Fahmi F. Muhammad ◽

Rohayanti Hassan ◽

Faisal Saeed

Keyword(s):

Gene Selection ◽

Hybrid Approach ◽

Microarray Datasets

Download Full-text

A test for fuzzy exponentiality based on Kullback-Leibler information

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202555 ◽

2021 ◽

pp. 1-8

Author(s):

Lingtao Kong

Keyword(s):

Biological Sciences ◽

Monte Carlo ◽

Goodness Of Fit ◽

Experimental Studies ◽

Real Data ◽

Test Statistics ◽

Test Statistic ◽

Goodness Of Fit Test ◽

Higher Power ◽

Leibler Information

The exponential distribution has been widely used in engineering, social and biological sciences. In this paper, we propose a new goodness-of-fit test for fuzzy exponentiality using α-pessimistic value. The test statistics is established based on Kullback-Leibler information. By using Monte Carlo method, we obtain the empirical critical points of the test statistic at four different significant levels. To evaluate the performance of the proposed test, we compare it with four commonly used tests through some simulations. Experimental studies show that the proposed test has higher power than other tests in most cases. In particular, for the uniform and linear failure rate alternatives, our method has the best performance. A real data example is investigated to show the application of our test.

Download Full-text

Gene Selection in Cancer Classification Using Sparse Logistic Regression with L1/2 Regularization

Applied Sciences ◽

10.3390/app8091569 ◽

2018 ◽

Vol 8 (9) ◽

pp. 1569 ◽

Cited By ~ 3

Author(s):

Shengbing Wu ◽

Hongkun Jiang ◽

Haiwei Shen ◽

Ziyi Yang

Keyword(s):

Logistic Regression ◽

Gene Selection ◽

Classification Performance ◽

Cancer Classification ◽

Sparse Logistic Regression ◽

The Subject ◽

Selection For ◽

Microarray Datasets ◽

Sparse Methods

In recent years, gene selection for cancer classification based on the expression of a small number of gene biomarkers has been the subject of much research in genetics and molecular biology. The successful identification of gene biomarkers will help in the classification of different types of cancer and improve the prediction accuracy. Recently, regularized logistic regression using the L 1 regularization has been successfully applied in high-dimensional cancer classification to tackle both the estimation of gene coefficients and the simultaneous performance of gene selection. However, the L 1 has a biased gene selection and dose not have the oracle property. To address these problems, we investigate L 1 / 2 regularized logistic regression for gene selection in cancer classification. Experimental results on three DNA microarray datasets demonstrate that our proposed method outperforms other commonly used sparse methods ( L 1 and L E N ) in terms of classification performance.

Download Full-text

Gene selection for cancer clustering analysis based on expression data

2015 4th International Conference on Computer Science and Network Technology (ICCSNT) ◽

10.1109/iccsnt.2015.7490801 ◽

2015 ◽

Author(s):

Taosheng Xu ◽

Ning Su ◽

Rujing Wang ◽

Liangtu Song

Keyword(s):

Clustering Analysis ◽

Gene Selection ◽

Expression Data ◽

Selection For

Download Full-text

An approach to gene-based testing accounting for dependence of tests among nearby genes

10.1101/2021.05.24.445494 ◽

2021 ◽

Author(s):

Ronald J Yurko ◽

Kathryn Roeder ◽

Bernie Devlin ◽

Max G'Sell

Keyword(s):

Multiple Testing ◽

Association Studies ◽

Autism Spectrum ◽

P Value ◽

Genome Wide Association Studies ◽

Strongly Correlated ◽

Test Statistics ◽

Test Statistic ◽

Genome Wide ◽

Insight Into

In genome-wide association studies (GWAS), it has become commonplace to test millions of SNPs for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive p-value thresholding (AdaPT), guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.

Download Full-text

Factors affecting the accuracy of a class prediction model in gene expression data

BMC Bioinformatics ◽

10.1186/s12859-015-0610-4 ◽

2015 ◽

Vol 16 (1) ◽

Cited By ~ 10

Author(s):

Putri W. Novianti ◽

Victor L. Jong ◽

Kit C. B. Roes ◽

Marinus J. C. Eijkemans

Keyword(s):

Gene Expression ◽

Prediction Model ◽

Gene Expression Data ◽

Expression Data ◽

Class Prediction ◽

Factors Affecting

Download Full-text

Testing for equality of means with equal and unequal variances

Scientia Africana ◽

10.4314/sa.v20i2.5 ◽

2021 ◽

Vol 20 (2) ◽

pp. 51-60

Author(s):

A.O. Abidoye ◽

W.A. Lamidi ◽

M.O. Alabi ◽

J. Popoola

Keyword(s):

Agricultural Development ◽

Secondary Data ◽

Development Project ◽

T Test ◽

Harmonic Mean ◽

Test Statistics ◽

Test Statistic ◽

Unequal Variances ◽

Equality Of Means ◽

Pooled Sample

In this paper, we are interested in comparing the conventional t –test with the proposed t – test for testing equality of means with unequal and equal variances. Here, we proposed harmonic mean of variances as an alternative to the pooled sample variance when there is heterogeneity of variances. Two sets of secondary data were obtained from Agricultural Development Project (KWADP) and the Ministry of Agriculture in Ilorin, Kwara State to demonstrate the two test statistics used and the results show that the proposed t – test statistic is found to be appropriate than the conventional t – test statistic when we have unequal variances but the conventional t – test perform better when we have equal variances.

Download Full-text