scholarly journals AdaReg: Data Adaptive Robust Estimation in Linear Regression with Application in GTEx Gene Expressions

2019 ◽  
Author(s):  
Meng Wang ◽  
Lihua Jiang ◽  
Michael P. Snyder

AbstractWith the development of high-throughput RNA sequencing (RNA-seq) technology, the Genotype Tissue-Expression (GTEx) project (Consortium et al., 2015) generated a valuable resource of gene expression data from more than 11,000 samples. The large-scale data set is a powerful resource for understanding the human transcriptome. However, the technical variation, sequencing background noise and unknown factors make the statistical analysis challenging. To eliminate the possibility that outliers might affect the estimation of population distribution, we need a more robust estimation method, a method that will adapt to heterogeneous genes and further optimize the estimate for each gene. We followed the approach of the robust estimation based on γ-density-power-weight (Fujisawa and Eguchi, 2008; Windham, 1995), where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture distributions. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable Mean Squared Error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) shows a significant advantage in both simulation studies and real data application of heart samples from the GTEx project compared to the fixed γ procedure and other robust methods. This paper discusses some limitations of this method, and future work.

Author(s):  
Meng Wang ◽  
Lihua Jiang ◽  
Michael P. Snyder

Abstract The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.


Mathematics ◽  
2021 ◽  
Vol 9 (15) ◽  
pp. 1815
Author(s):  
Diego I. Gallardo ◽  
Mário de Castro ◽  
Héctor W. Gómez

A cure rate model under the competing risks setup is proposed. For the number of competing causes related to the occurrence of the event of interest, we posit the one-parameter Bell distribution, which accommodates overdispersed counts. The model is parameterized in the cure rate, which is linked to covariates. Parameter estimation is based on the maximum likelihood method. Estimates are computed via the EM algorithm. In order to compare different models, a selection criterion for non-nested models is implemented. Results from simulation studies indicate that the estimation method and the model selection criterion have a good performance. A dataset on melanoma is analyzed using the proposed model as well as some models from the literature.


2019 ◽  
Vol 8 (2) ◽  
pp. 231-263 ◽  
Author(s):  
Richard Valliant

Abstract Three approaches to estimation from nonprobability samples are quasi-randomization, superpopulation modeling, and doubly robust estimation. In the first, the sample is treated as if it were obtained via a probability mechanism, but unlike in probability sampling, that mechanism is unknown. Pseudo selection probabilities of being in the sample are estimated by using the sample in combination with some external data set that covers the desired population. In the superpopulation approach, observed values of analysis variables are treated as if they had been generated by some model. The model is estimated from the sample and, along with external population control data, is used to project the sample to the population. The specific techniques are the same or similar to ones commonly employed for estimation from probability samples and include binary regression, regression trees, and calibration. When quasi-randomization and superpopulation modeling are combined, this is referred to as doubly robust estimation. This article reviews some of the estimation options and compares them in a series of simulation studies.


2019 ◽  
Author(s):  
Meng Wang ◽  
Lihua Jiang ◽  
Michael P. Snyder

AbstractMotivationAccurately detecting tissue specificity (TS) in genes helps researchers understand tissue functions at the molecular level, and further identify disease mechanisms and discover tissue-specific therapeutic targets. The Genotype-Tissue Expression (GTEx) project (Consortium, 2015), and the Human Protein Atlas (HPA) project (Uhlén, et al., 2015) are two publicly available data resources, providing large-scale gene expressions across multiple tissue types. Multiple tissue comparisons, technical background noise and unknown variation factors make it challenging to accurately identify tissue specific gene expressions. Several methods worked on measuring the overall TS in gene expressions and classifying genes into tissue-enrichment categories. There still lacks a robust method to provide quantitative TS scores for each tissue.MethodsWe recognized that the key to quantify tissue specific gene expressions is to properly define a concept of expression population. We considered that inside the population, the sample expressions from various tissues are more or less balanced, and the outlier expressions outside the population may indicate tissue specificity. We then formulated the question to robustly estimate the population distribution. In a linear regression problem, we developed a novel data-adaptive robust estimation based on density-power-weight under unknown outlier distribution and non-vanishing outlier proportion (Wang, et al., 2019). In the question of quantifying TS, we focused on the Gaussian-population mixture model. We took into account gene heterogeneities and applied the robust data-adaptive procedure to estimate the population. With the robustly estimated population parameters, we constructed the AdaTiSS algorithm to obtain data-adaptive quantitative TS scores.ResultsOur TS scores from the AdaTiSS algorithm achieve the goal that the TS scores are comparable across tissues and also across genes, which standardize gene expressions in terms of TS. Compared to the categorical TS method such as the HPA criterion, our method provides more information on the population fitting, and shows advantages in quantitatively analyzing tissue specific functions, making the biology functional analysis more precise. We also discuss some limitations and possible future [email protected]


Author(s):  
Mostafa H. Tawfeek ◽  
Karim El-Basyouny

Safety Performance Functions (SPFs) are regression models used to predict the expected number of collisions as a function of various traffic and geometric characteristics. One of the integral components in developing SPFs is the availability of accurate exposure factors, that is, annual average daily traffic (AADT). However, AADTs are not often available for minor roads at rural intersections. This study aims to develop a robust AADT estimation model using a deep neural network. A total of 1,350 rural four-legged, stop-controlled intersections from the Province of Alberta, Canada, were used to train the neural network. The results of the deep neural network model were compared with the traditional estimation method, which uses linear regression. The results indicated that the deep neural network model improved the estimation of minor roads’ AADT by 35% when compared with the traditional method. Furthermore, SPFs developed using linear regression resulted in models with statistically insignificant AADTs on minor roads. Conversely, the SPF developed using the neural network provided a better fit to the data with both AADTs on minor and major roads being statistically significant variables. The findings indicated that the proposed model could enhance the predictive power of the SPF and therefore improve the decision-making process since SPFs are used in all parts of the safety management process.


2019 ◽  
Vol 17 (06) ◽  
pp. 947-975 ◽  
Author(s):  
Lei Shi

We investigate the distributed learning with coefficient-based regularization scheme under the framework of kernel regression methods. Compared with the classical kernel ridge regression (KRR), the algorithm under consideration does not require the kernel function to be positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods. The distributed learning approach partitions a massive data set into several disjoint data subsets, and then produces a global estimator by taking an average of the local estimator on each data subset. Easy exercisable partitions and performing algorithm on each subset in parallel lead to a substantial reduction in computation time versus the standard approach of performing the original algorithm on the entire samples. We establish the first mini-max optimal rates of convergence for distributed coefficient-based regularization scheme with indefinite kernels. We thus demonstrate that compared with distributed KRR, the concerned algorithm is more flexible and effective in regression problem for large-scale data sets.


2016 ◽  
Vol 311 (3) ◽  
pp. F539-F547 ◽  
Author(s):  
Minhtri K. Nguyen ◽  
Dai-Scott Nguyen ◽  
Minh-Kevin Nguyen

Because changes in the plasma water sodium concentration ([Na+]pw) are clinically due to changes in the mass balance of Na+, K+, and H2O, the analysis and treatment of the dysnatremias are dependent on the validity of the Edelman equation in defining the quantitative interrelationship between the [Na+]pw and the total exchangeable sodium (Nae), total exchangeable potassium (Ke), and total body water (TBW) (Edelman IS, Leibman J, O'Meara MP, Birkenfeld LW. J Clin Invest 37: 1236–1256, 1958): [Na+]pw = 1.11(Nae + Ke)/TBW − 25.6. The interrelationship between [Na+]pw and Nae, Ke, and TBW in the Edelman equation is empirically determined by accounting for measurement errors in all of these variables. In contrast, linear regression analysis of the same data set using [Na+]pw as the dependent variable yields the following equation: [Na+]pw = 0.93(Nae + Ke)/TBW + 1.37. Moreover, based on the study by Boling et al. (Boling EA, Lipkind JB. 18: 943–949, 1963), the [Na+]pw is related to the Nae, Ke, and TBW by the following linear regression equation: [Na+]pw = 0.487(Nae + Ke)/TBW + 71.54. The disparities between the slope and y-intercept of these three equations are unknown. In this mathematical analysis, we demonstrate that the disparities between the slope and y-intercept in these three equations can be explained by how the osmotically inactive Na+ and K+ storage pool is quantitatively accounted for. Our analysis also indicates that the osmotically inactive Na+ and K+ storage pool is dynamically regulated and that changes in the [Na+]pw can be predicted based on changes in the Nae, Ke, and TBW despite dynamic changes in the osmotically inactive Na+ and K+ storage pool.


1995 ◽  
Vol 3 (3) ◽  
pp. 133-142 ◽  
Author(s):  
M. Hana ◽  
W.F. McClure ◽  
T.B. Whitaker ◽  
M. White ◽  
D.R. Bahler

Two artificial neural network models were used to estimate the nicotine in tobacco: (i) a back-propagation network and (ii) a linear network. The back-propagation network consisted of an input layer, an output layer and one hidden layer. The linear network consisted of an input layer and an output layer. Both networks used the generalised delta rule for learning. Performances of both networks were compared to the multiple linear regression method MLR of calibration. The nicotine content in tobacco samples was estimated for two different data sets. Data set A contained 110 near infrared (NIR) spectra each consisting of reflected energy at eight wavelengths. Data set B consisted of 200 NIR spectra with each spectrum having 840 spectral data points. The Fast Fourier transformation was applied to data set B in order to compress each spectrum into 13 Fourier coefficients. For data set A, the linear regression model gave better results followed by the back-propagation network which was followed by the linear network. The true performance of the linear regression model was better than the back-propagation and the linear networks by 14.0% and 18.1%, respectively. For data set B, the back-propagation network gave the best result followed by MLR and the linear network. Both the linear network and MLR models gave almost the same results. The true performance of the back-propagation network model was better than the MLR and linear network by 35.14%.


Forecasting ◽  
2021 ◽  
Vol 3 (1) ◽  
pp. 56-90
Author(s):  
Monica Defend ◽  
Aleksey Min ◽  
Lorenzo Portelli ◽  
Franz Ramsauer ◽  
Francesco Sandrini ◽  
...  

This article considers the estimation of Approximate Dynamic Factor Models with homoscedastic, cross-sectionally correlated errors for incomplete panel data. In contrast to existing estimation approaches, the presented estimation method comprises two expectation-maximization algorithms and uses conditional factor moments in closed form. To determine the unknown factor dimension and autoregressive order, we propose a two-step information-based model selection criterion. The performance of our estimation procedure and the model selection criterion is investigated within a Monte Carlo study. Finally, we apply the Approximate Dynamic Factor Model to real-economy vintage data to support investment decisions and risk management. For this purpose, an autoregressive model with the estimated factor span of the mixed-frequency data as exogenous variables maps the behavior of weekly S&P500 log-returns. We detect the main drivers of the index development and define two dynamic trading strategies resulting from prediction intervals for the subsequent returns.


Sign in / Sign up

Export Citation Format

Share Document