RNA-seq analyses of molecular abundance (RoMA) for detecting differential gene expression

Mapping Intimacies ◽

10.1101/410985 ◽

2018 ◽

Author(s):

Guoshuai Cai ◽

Jennifer M. Franks ◽

Michael L. Whitfield

Keyword(s):

Type I Error ◽

Simulated Data ◽

Mrna Abundance ◽

Type I ◽

Rna Seq ◽

Improved Method ◽

Efficient Control ◽

Differential Gene ◽

Abundance Modeling ◽

Accurate Quantification

AbstractMotivationVarious methods have been proposed, each with its own limitations. Some naive normal-based tests have low testing power with invalid normal distribution assumptions for RNA-seq read counts, whereas count-based methods lack a biologically meaningful interpretation and have limited capability for integration with other analysis packages for mRNA abundance. In this study, we propose an improved method, RoMA, to accurately detect differential expression and unlock the integration with upstream and downstream analyses on mRNA abundance in RNA-seq studies.ResultsRoMA incorporates information from both mRNA abundance and raw counts. Studies on simulated data and two real datasets showed that RoMA provides an accurate quantification of mRNA abundance and a data adjustment-tolerant DE analysis with high AUC, low FDR, and an efficient control of type I error rate. This study provides a valid strategy for mRNA abundance modeling and data analysis integration for RNA-seq studies, which will greatly facilitate the identification and interpretation of DE genes.Availability and implementationRoMA is available at https://github.com/GuoshuaiCai/[email protected] or [email protected]

Download Full-text

Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data

10.1101/073973 ◽

2016 ◽

Cited By ~ 3

Author(s):

Aaron T. L. Lun ◽

John C. Marioni

Keyword(s):

Single Cell ◽

Error Control ◽

Type I Error ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Simulated Data ◽

Real Data ◽

Type I ◽

Rna Seq ◽

Cell Counts

AbstractAn increasing number of studies are using single-cell RNA-sequencing (scRNA-seq) to characterize the gene expression profiles of individual cells. One common analysis applied to scRNA-seq data involves detecting differentially expressed (DE) genes between cells in different biological groups. However, many experiments are designed such that the cells to be compared are processed in separate plates or chips, meaning that the groupings are confounded with systematic plate effects. This confounding aspect is frequently ignored in DE analyses of scRNA-seq data. In this article, we demonstrate that failing to consider plate effects in the statistical model results in loss of type I error control. A solution is proposed whereby counts are summed from all cells in each plate and the count sums for all plates are used in the DE analysis. This restores type I error control in the presence of plate effects without compromising detection power in simulated data. Summation is also robust to varying numbers and library sizes of cells on each plate. Similar results are observed in DE analyses of real data where the use of count sums instead of single-cell counts improves specificity and the ranking of relevant genes. This suggests that summation can assist in maintaining statistical rigour in DE analyses of scRNA-seq data with plate effects.

Download Full-text

A Prototype for Brazilian Bankcheck Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001497000238 ◽

1997 ◽

Vol 11 (04) ◽

pp. 549-569 ◽

Cited By ~ 16

Author(s):

Luan L. Lee ◽

Miguel G. Lizarraga ◽

Natanael R. Gomes ◽

Alessandro L. Koerich

Keyword(s):

Information Extraction ◽

Character Recognition ◽

Type I Error ◽

Recognition Rate ◽

Simulated Data ◽

Recognition System ◽

Signature Verification ◽

Type I ◽

Verification Algorithm ◽

Commercial Applications

This paper describes a prototype for Brazilian bankcheck recognition. The description is divided into three topics: bankcheck information extraction, digit amount recognition and signature verification. In bankcheck information extraction, our algorithms provide signature and digit amount images free of background patterns and bankcheck printed information. In digit amount recognition, we dealt with the digit amount segmentation and implementation of a complete numeral character recognition system involving image processing, feature extraction and neural classification. In signature verification, we designed and implemented a static signature verification system suitable for banking and commercial applications. Our signature verification algorithm is capable of detecting both simple, random and skilled forgeries. The proposed automatic bankcheck recognition prototype was intensively tested by real bankcheck data as well as simulated data providing the following performance results: for skilled forgeries, 4.7% equal error rate; for random forgeries, zero Type I error and 7.3% Type II error; for bankcheck numerals, 92.7% correct recognition rate.

Download Full-text

DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz453 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5155-5162 ◽

Cited By ~ 10

Author(s):

Chengzhong Ye ◽

Terence P Speed ◽

Agus Salim

Keyword(s):

Single Cell ◽

Differential Expression ◽

Type I Error ◽

R Package ◽

Supplementary Information ◽

Type I ◽

Common Phenomenon ◽

Rna Seq ◽

Capture Process ◽

Technological Platforms

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0010 ◽

2017 ◽

Vol 16 (2) ◽

Cited By ~ 1

Author(s):

Aaron T. L. Lun ◽

Gordon K. Smyth

Keyword(s):

Software Package ◽

Error Control ◽

Degrees Of Freedom ◽

Linear Models ◽

Type I Error ◽

Real Data ◽

Type I ◽

Rna Seq ◽

Study Gene Expression ◽

Complex Models

AbstractRNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

Download Full-text

Mapping Tumor-Specific Expression QTLs in Impure Tumor Samples

10.1101/136614 ◽

2017 ◽

Cited By ~ 3

Author(s):

Douglas R. Wilson ◽

Wei Sun ◽

Joseph G. Ibrahim

Keyword(s):

Gene Expression ◽

Type I Error ◽

The Cancer Genome Atlas ◽

Type I ◽

Eqtl Mapping ◽

Rna Seq ◽

Specific Expression ◽

Normal Cells ◽

Technology Application ◽

Tumor Tissues

AbstractThe study of gene expression quantitative trait loci (eQTL) is an effective approach to illuminate the functional roles of genetic variants. Computational methods have been developed for eQTL mapping using gene expression data from microarray or RNA-seq technology. Application of these methods for eQTL mapping in tumor tissues is problematic because tumor tissues are composed of both tumor and infiltrating normal cells (e.g. immune cells) and eQTL effects may vary between tumor and infiltrating normal cells. To address this challenge, we have developed a new method for eQTL mapping using RNA-seq data from tumor samples. Our method separately estimates the eQTL effects in tumor and infiltrating normal cells using both total expression and allele-specific expression (ASE). We demonstrate that our method controls type I error rate and has higher power than some alternative approaches. We applied our method to study RNA-seq data from The Cancer Genome Atlas and illustrated the similarities and differences of eQTL effects in tumor and normal cells.

Download Full-text

Super-delta2: An Enhanced Differential Expression Analysis Procedure for Multi-Group Comparisons of RNA-seq Data

10.1101/2021.01.30.428977 ◽

2021 ◽

Author(s):

Zihan Cui ◽

Yuhang Liu ◽

Jinfeng Zhang ◽

Xing Qiu

Keyword(s):

Breast Cancer ◽

Differential Expression ◽

Expression Analysis ◽

Bias Correction ◽

Type I Error ◽

Breast Cancer Dataset ◽

Type I ◽

Rna Seq ◽

Differential Expression Pattern ◽

Group Comparisons

AbstractBackgroundWe developed super-delta2, a differential gene expression analysis pipeline designed for multi-group comparisons for RNA-seq data. It includes a customized one-way ANOVA F-test and a post-hoc test for pairwise group comparisons; both are designed to work with a multivariate normalization procedure to reduce technical noise. It also includes a trimming procedure with bias-correction to obtain robust and approximately unbiased summary statistics used in these tests. We demonstrated the asymptotic applicability of super-delta2 to log-transformed read counts in RNA-seq data by large sample theory based on Negative Binomial Poisson (NBP) distribution.ResultsWe compared super-delta2 with three commonly used RNA-seq data analysis methods: limma/voom, edgeR, and DESeq2 using both simulated and real datasets. In all three simulation settings, super-delta2 not only achieved the best overall statistical power, but also was the only method that controlled type I error at the nominal level. When applied to a breast cancer dataset to identify differential expression pattern associated with multiple pathologic stages, super-delta2 selected more enriched pathways than other methods, which are directly linked to the underlying biological condition (breast cancer).ConclusionsBy incorporating trimming and bias-correction in the normalization step, super-delta2 was able to achieve tight control of type I error. Because the hypothesis tests are based on asymptotic normal approximation of the NBP distribution, super-delta2 does not require computationally expensive iterative optimization procedures used by methods such as edgeR and DESeq2, which occasionally have convergence issues.

Download Full-text

Use of two-point models in “Model choice in time-series studies of air pollution and mortality”

Air Quality Atmosphere & Health ◽

10.1007/s11869-019-00787-5 ◽

2020 ◽

Vol 13 (2) ◽

pp. 225-232 ◽

Cited By ~ 2

Author(s):

Mieczysław Szyszkowicz

Keyword(s):

Time Series ◽

Poisson Regression ◽

Type I Error ◽

Simulated Data ◽

Mortality Data ◽

Type I ◽

Case Crossover ◽

Short Term Exposure ◽

Crossover Method ◽

Conditional Poisson

AbstractIn this work, a new technique is proposed to study short-term exposure and adverse health effects. The presented approach uses hierarchical clusters with the following structure: each pair of two sequential days in 1 year is embedded in the year. We have 183 clusters per year with the embedded structure <year:2 days>. Time-series analysis is conducted using a conditional Poisson regression with the constructed clusters as a stratum. Unmeasured confounders such as seasonal and long-term trends are not modelled but are controlled by the structure of the clusters. The proposed technique is illustrated using four freely accessible databases, which contain complex simulated data. These data are available as the compressed R workspace files. Results based on the simulated data were very close to the truth based on the presented methodology. In addition, the case-crossover method with 1-month and 2-week window, and a conditional Poisson regression on 3-day clusters as a stratum, was also applied to the simulated data. Difficulties (high type I error rate) were observed for the case-crossover method in the presence of high concurvity in the simulated data. The proposed methods using various forms of a stratum were further applied to the Chicago mortality data. The considered methods have often different qualitative and quantitative estimations.

Download Full-text

Identifying the Informational/Signal Dimension in Principal Component Analysis

Mathematics ◽

10.3390/math6110269 ◽

2018 ◽

Vol 6 (11) ◽

pp. 269 ◽

Cited By ~ 2

Author(s):

Sergio Camiz ◽

Valério Pillar

Keyword(s):

Principal Component Analysis ◽

Type I Error ◽

Bootstrap Method ◽

Simulated Data ◽

Principal Component ◽

Component Analysis ◽

Stopping Rules ◽

Multidimensional Data ◽

Type I ◽

Zero Eigenvalue

The identification of a reduced dimensional representation of the data is among the main issues of exploratory multidimensional data analysis and several solutions had been proposed in the literature according to the method. Principal Component Analysis (PCA) is the method that has received the largest attention thus far and several identification methods—the so-called stopping rules—have been proposed, giving very different results in practice, and some comparative study has been carried out. Some inconsistencies in the previous studies led us to try to fix the distinction between signal from noise in PCA—and its limits—and propose a new testing method. This consists in the production of simulated data according to a predefined eigenvalues structure, including zero-eigenvalues. From random populations built according to several such structures, reduced-size samples were extracted and to them different levels of random normal noise were added. This controlled introduction of noise allows a clear distinction between expected signal and noise, the latter relegated to the non-zero eigenvalues in the samples corresponding to zero ones in the population. With this new method, we tested the performance of ten different stopping rules. Of every method, for every structure and every noise, both power (the ability to correctly identify the expected dimension) and type-I error (the detection of a dimension composed only by noise) have been measured, by counting the relative frequencies in which the smallest non-zero eigenvalue in the population was recognized as signal in the samples and that in which the largest zero-eigenvalue was recognized as noise, respectively. This way, the behaviour of the examined methods is clear and their comparison/evaluation is possible. The reported results show that both the generalization of the Bartlett’s test by Rencher and the Bootstrap method by Pillar result much better than all others: both are accounted for reasonable power, decreasing with noise, and very good type-I error. Thus, more than the others, these methods deserve being adopted.

Download Full-text

Using AB Designs With Nonoverlap Effect Size Measures to Support Clinical Decision-Making: A Monte Carlo Validation

Behavior Modification ◽

10.1177/0145445519860219 ◽

2019 ◽

pp. 014544551986021 ◽

Cited By ~ 1

Author(s):

Antonia R. Giannakakos ◽

Marc J. Lanovaz

Keyword(s):

Error Rate ◽

Effect Size ◽

Clinical Decision Making ◽

Type I Error ◽

Single Case ◽

Simulated Data ◽

Type I ◽

Type I Error Rate ◽

Sufficient Power ◽

Quasi Experimental

Single-case experimental designs often require extended baselines or the withdrawal of treatment, which may not be feasible or ethical in some practical settings. The quasi-experimental AB design is a potential alternative, but more research is needed on its validity. The purpose of our study was to examine the validity of using nonoverlap measures of effect size to detect changes in AB designs using simulated data. In our analyses, we determined thresholds for three effect size measures beyond which the type I error rate would remain below 0.05 and then examined whether using these thresholds would provide sufficient power. Overall, our analyses show that some effect size measures may provide adequate control over type I error rate and sufficient power when analyzing data from AB designs. In sum, our results suggest that practitioners may use quasi-experimental AB designs in combination with effect size to rigorously assess progress in practice.

Download Full-text

MARS: leveraging allelic heterogeneity to increase power of association testing

Genome Biology ◽

10.1186/s13059-021-02353-8 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Farhad Hormozdiari ◽

Junghyun Jung ◽

Eleazar Eskin ◽

Jong Wha J. Joo

Keyword(s):

Type I Error ◽

Association Studies ◽

Simulated Data ◽

Real Data ◽

Association Test ◽

Type I ◽

Genome Wide Association Studies ◽

Association Testing ◽

Causal Status ◽

Causal Variants

AbstractIn standard genome-wide association studies (GWAS), the standard association test is underpowered to detect associations between loci with multiple causal variants with small effect sizes. We propose a statistical method, Model-based Association test Reflecting causal Status (MARS), that finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci. Utilizing extensive simulated data and real data, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while controlling the type I error.

Download Full-text