ordinalgmifs: An R Package for Ordinal Regression in High-dimensional Data Settings

Cancer Informatics ◽

10.4137/cin.s20806 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S20806 ◽

Cited By ~ 12

Author(s):

Kellie J. Archer ◽

Jiayi Hou ◽

Qing Zhou ◽

Kyle Ferber ◽

John G. Layne ◽

...

Keyword(s):

R Package ◽

High Dimensional ◽

Tissue Samples ◽

Response Models ◽

Modeling Methods ◽

Ordinal Response ◽

Molecular Features ◽

Genomic Assays ◽

Small Set ◽

Response Modeling

High-throughput genomic assays are performed using tissue samples with the goal of classifying the samples as normal < pre-malignant < malignant or by stage of cancer using a small set of molecular features. In such cases, molecular features monotonically associated with the ordinal response may be important to disease development; that is, an increase in the phenotypic level (stage of cancer) may be mechanistically linked through a monotonic association with gene expression or methylation levels. Though traditional ordinal response modeling methods exist, they assume independence among the predictor variables and require the number of samples ( n) to exceed the number of covariates ( P) included in the model. In this paper, we describe our ordinalgmifs R package, available from the Comprehensive R Archive Network, which can fit a variety of ordinal response models when the number of predictors ( P) exceeds the sample size ( n). R code illustrating usage is also provided.

Download Full-text

Bayesian variable selection for high-dimensional data with an ordinal response: identifying genes associated with prognostic risk group in acute myeloid leukemia

BMC Bioinformatics ◽

10.1186/s12859-021-04432-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yiran Zhang ◽

Kellie J. Archer

Keyword(s):

Myeloid Leukemia ◽

Risk Group ◽

Risk Groups ◽

High Dimensional ◽

Response Models ◽

Ordinal Response ◽

Expression Of Genes ◽

Double Exponential ◽

Selection For ◽

Acute Myeloid

Abstract Background Acute myeloid leukemia (AML) is a heterogeneous cancer of the blood, though specific recurring cytogenetic abnormalities in AML are strongly associated with attaining complete response after induction chemotherapy, remission duration, and survival. Therefore recurring cytogenetic abnormalities have been used to segregate patients into favorable, intermediate, and adverse prognostic risk groups. However, it is unclear how expression of genes is associated with these prognostic risk groups. We postulate that expression of genes monotonically associated with these prognostic risk groups may yield important insights into leukemogenesis. Therefore, in this paper we propose penalized Bayesian ordinal response models to predict prognostic risk group using gene expression data. We consider a double exponential prior, a spike-and-slab normal prior, a spike-and-slab double exponential prior, and a regression-based approach with variable inclusion indicators for modeling our high-dimensional ordinal response, prognostic risk group, and identify genes through hypothesis tests using Bayes factor. Results Gene expression was ascertained using Affymetrix HG-U133Plus2.0 GeneChips for 97 favorable, 259 intermediate, and 97 adverse risk AML patients. When applying our penalized Bayesian ordinal response models, genes identified for model inclusion were consistent among the four different models. Additionally, the genes included in the models were biologically plausible, as most have been previously associated with either AML or other types of cancer. Conclusion These findings demonstrate that our proposed penalized Bayesian ordinal response models are useful for performing variable selection for high-dimensional genomic data and have the potential to identify genes relevantly associated with an ordinal phenotype.

Download Full-text

Bayesian Variable Selection For High-Dimensional Data With An Ordinal Response: Application Predicting Prognostic Risk Group In Acute Myeloid Leukemia

10.21203/rs.3.rs-581629/v1 ◽

2021 ◽

Author(s):

Yiran Zhang ◽

Kellie J. Archer

Keyword(s):

Myeloid Leukemia ◽

Risk Group ◽

Risk Groups ◽

High Dimensional ◽

Response Models ◽

Ordinal Response ◽

Expression Of Genes ◽

Double Exponential ◽

Selection For ◽

Acute Myeloid

Abstract Background: Acute myeloid leukemia (AML) is a heterogeneous cancer of the blood, though specific recurring cytogenetic abnormalities in AML strongly are associated with attaining complete response after induction chemotherapy, remission duration, and survival. Therefore recurring cytogenetic abnormalities have been used to segregate patients into favorable, intermediate, and adverse prognostic risk groups. However, it is unclear how expression of genes is associated with these prognostic risk groups. We postulate that expression of genes monotonically associated with these prognostic risk groups may yield important insights into leukemogenesis. Therefore, in this paper we propose penalized Bayesian ordinal response models to predict prognostic risk group using gene expression data. We consider a double exponential prior, a spike-and-slab normal prior, a spike-and-slab double exponential prior, and a regression-based approach with variable inclusion indicators for modeling our high-dimensional ordinal response, prognostic risk group, and identify genes through hypothesis tests using Bayes Factor. Results: Gene expression was ascertained using Affymetrix HG-U133Plus2.0 GeneChips for 97 favorable, 259 intermediate, and 97 adverse risk AML patients. When applying our penalized Bayesian ordinal response models, genes identified for model inclusion were consistent among the four different models. Additionally, the genes included in the models were biologically plausible, as most have been previously associated with either AML or other types of cancer. Conclusion: These findings demonstrate that our proposed penalized Bayesian ordinal response models are useful for performing variable selection for high-dimensional genomic data and have the potential to identify genes relevantly associated with an ordinal phenotype.

Download Full-text

Identification of cell-type-specific marker genes from co-expression patterns in tissue samples

Bioinformatics ◽

10.1093/bioinformatics/btab257 ◽

2021 ◽

Author(s):

Yixuan Qiu ◽

Jiebiao Wang ◽

Jing Lei ◽

Kathryn Roeder

Keyword(s):

Single Cell ◽

Expression Patterns ◽

R Package ◽

Supplementary Information ◽

Marker Genes ◽

Specific Marker ◽

Cell Type ◽

Correlation Pattern ◽

Tissue Samples ◽

Bulk Data

Abstract Motivation Marker genes, defined as genes that are expressed primarily in a single cell type, can be identified from the single cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern. Results To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list. Availability and implementation We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An R Package for Divergence Analysis of Omics Data

10.1101/720391 ◽

2019 ◽

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.

Download Full-text

R/qtlcharts: interactive graphics for quantitative trait locus mapping

10.1101/011437 ◽

2014 ◽

Cited By ~ 1

Author(s):

Karl W Broman

Keyword(s):

Quantitative Trait Locus ◽

Quantitative Trait Locus Mapping ◽

Quantitative Trait ◽

Quantitative Traits ◽

R Package ◽

High Dimensional ◽

Interactive Graphics ◽

Phenotype Data ◽

Trait Locus ◽

Locus Mapping

Every data visualization can be improved with some level of interactivity. Interactive graphics hold particular promise for the exploration of high-dimensional data. R/qtlcharts is an R package to create interactive graphics for experiments to map quantitative trait loci (QTL; genetic loci that influence quantitative traits). R/qtlcharts serves as a companion to the R/qtl package, providing interactive versions of R/qtl's static graphs, as well as additional interactive graphs for the exploration of high-dimensional genotype and phenotype data.

Download Full-text

An R package for divergence analysis of omics data

PLoS ONE ◽

10.1371/journal.pone.0249002 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0249002

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis ◽

Data Analysis Methods ◽

Genome Atlas ◽

Omics Data Analysis

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.

Download Full-text

An Explained Variation Measure for Ordinal Response Models With Comparisons to Other Ordinal R² Measures

Sociological Methods & Research ◽

10.1177/0049124106286329 ◽

2006 ◽

Vol 34 (4) ◽

pp. 469-520 ◽

Cited By ~ 22

Author(s):

Michael G. Lacy

Keyword(s):

Response Models ◽

Ordinal Response ◽

Explained Variation

Download Full-text

Modified Mixed Generalized Ordered Response Model to Handle Misclassification in Injury Severity Data

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198118796352 ◽

2018 ◽

Vol 2672 (30) ◽

pp. 53-63 ◽

Cited By ~ 4

Author(s):

Lacramioara Balan ◽

Rajesh Paleti

Keyword(s):

Injury Severity ◽

Random Variable ◽

Parameter Estimates ◽

Discrete Random Variable ◽

Response Models ◽

Ordinal Response ◽

Proposed Model ◽

Misclassification Errors ◽

Misclassification Rates ◽

Ordered Response

Traditional crash databases that record police-reported injury severity data are prone to misclassification errors. Ignoring these errors in discrete ordered response models used for analyzing injury severity can lead to biased and inconsistent parameter estimates. In this study, a mixed generalized ordered response (MGOR) model that quantifies misclassification rates in the injury severity variable and adjusts the bias in parameter estimates associated with misclassification was developed. The proposed model does this by considering the observed injury severity outcome as a realization from a discrete random variable that depends on true latent injury severity that is unobservable to the analyst. The model was used to analyze misclassification rates in police-reported injury severity in the 2014 General Estimates System (GES) data. The model found that only 68.23% and 62.75% of possible and non-incapacitating injuries were correctly recorded in the GES data. Moreover, comparative analysis with the MGOR model that ignores misclassification not only has lower data fit but also considerable bias in both the parameter and elasticity estimates. The model developed in this study can be used to analyze misclassification errors in ordinal response variables in other empirical contexts.

Download Full-text

A Tutorial on : R Package for the Linearized Bregman Algorithm in High-Dimensional Statistics

Handbook of Big Data Analytics - Springer Handbooks of Computational Statistics ◽

10.1007/978-3-319-18284-1_17 ◽

2018 ◽

pp. 425-453

Author(s):

Jiechao Xiong ◽

Feng Ruan ◽

Yuan Yao

Keyword(s):

R Package ◽

High Dimensional ◽

High Dimensional Statistics

Download Full-text

An Optimized Semi-Supervised Learning Approach for High Dimensional Datasets

Advances in Bioinformatics and Biomedical Engineering - Applying Big Data Analytics in Bioinformatics and Medicine ◽

10.4018/978-1-5225-2607-0.ch012 ◽

2018 ◽

pp. 294-321

Author(s):

Nesma Settouti ◽

Mostafa El Habib Daho ◽

Mohammed El Amine Bechar ◽

Mohammed Amine Chikh

Keyword(s):

Supervised Learning ◽

High Dimensional ◽

Diagnostic Process ◽

Importance Measure ◽

Learning Approach ◽

Learning From Data ◽

Selection Strategies ◽

Medical Diagnostic ◽

Small Set ◽

High Dimensional Datasets

The semi-supervised learning is one of the most interesting fields for research developments in the machine learning domain beyond the scope of supervised learning from data. Medical diagnostic process works mostly in supervised mode, but in reality, we are in the presence of a large amount of unlabeled samples and a small set of labeled examples characterized by thousands of features. This problem is known under the term “the curse of dimensionality”. In this study, we propose, as solution, a new approach in semi-supervised learning that we would call Optim Co-forest. The Optim Co-forest algorithm combines the re-sampling data approach (Bagging Breiman, 1996) with two selection strategies. The first one involves selecting random subset of parameters to construct the ensemble of classifiers following the principle of Co-forest (Li & Zhou, 2007). The second strategy is an extension of the importance measure of Random Forest (RF; Breiman, 2001). Experiments on high dimensional datasets confirm the power of the adopted selection strategies in the scalability of our method.

Download Full-text