scholarly journals Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

2015 ◽  
Author(s):  
David M Rocke ◽  
Luyao Ruan ◽  
Yilun Zhang ◽  
J. Jared Gossett ◽  
Blythe Durbin-Johnson ◽  
...  

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

2014 ◽  
Vol 644-650 ◽  
pp. 3338-3341 ◽  
Author(s):  
Guang Feng Guo

During the 30-year development of the Intrusion Detection System, the problems such as the high false-positive rate have always plagued the users. Therefore, the ontology and context verification based intrusion detection model (OCVIDM) was put forward to connect the description of attack’s signatures and context effectively. The OCVIDM established the knowledge base of the intrusion detection ontology that was regarded as the center of efficient filtering platform of the false alerts to realize the automatic validation of the alarm and self-acting judgment of the real attacks, so as to achieve the goal of filtering the non-relevant positives alerts and reduce false positives.


2020 ◽  
Vol 30 (12) ◽  
pp. 1851-1855
Author(s):  
Sruti Rao ◽  
M. B. Goens ◽  
Orrin B. Myers ◽  
Emilie A. Sebesta

AbstractAim:To determine the false-positive rate of pulse oximetry screening at moderate altitude, presumed to be elevated compared with sea level values and assess change in false-positive rate with time.Methods:We retrospectively analysed 3548 infants in the newborn nursery in Albuquerque, New Mexico, (elevation 5400 ft) from July 2012 to October 2013. Universal pulse oximetry screening guidelines were employed after 24 hours of life but before discharge. Newborn babies between 36 and 36 6/7 weeks of gestation, weighing >2 kg and babies >37 weeks weighing >1.7 kg were included in the study. Log-binomial regression was used to assess change in the probability of false positives over time.Results:Of the 3548 patients analysed, there was one true positive with a posteriorly-malaligned ventricular septal defect and an interrupted aortic arch. Of the 93 false positives, the mean pre- and post-ductal saturations were lower, 92 and 90%, respectively. The false-positive rate before April 2013 was 3.5% and after April 2013, decreased to 1.5%. There was a significant decrease in false-positive rate (p = 0.003, slope coefficient = −0.082, standard error of coefficient = 0.023) with the relative risk of a false positive decreasing at 0.92 (95% CI 0.88–0.97) per month.Conclusion:This is the first study in Albuquerque, New Mexico, reporting a high false-positive rate of 1.5% at moderate altitude at the end of the study in comparison to the false-positive rate of 0.035% at sea level. Implementation of the nationally recommended universal pulse oximetry screening was associated with a high false-positive rate in the initial period, thought to be from the combination of both learning curve and altitude. After the initial decline, it remained steadily elevated above sea level, indicating the dominant effect of moderate altitude.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Ginette Lafit ◽  
Francis Tuerlinckx ◽  
Inez Myin-Germeys ◽  
Eva Ceulemans

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.


2012 ◽  
Vol 11 ◽  
pp. CIN.S9048 ◽  
Author(s):  
Shuhei Kaneko ◽  
Akihiro Hirakawa ◽  
Chikuma Hamada

Mining of gene expression data to identify genes associated with patient survival is an ongoing problem in cancer prognostic studies using microarrays in order to use such genes to achieve more accurate prognoses. The least absolute shrinkage and selection operator (lasso) is often used for gene selection and parameter estimation in high-dimensional microarray data. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. We propose a method for estimating the false positive rate (FPR) for lasso estimates in a high-dimensional Cox model. We performed a simulation study to examine the precision of the FPR estimate by the proposed method. We applied the proposed method to real data and illustrated the identification of false positive genes.


2016 ◽  
Vol 14 (06) ◽  
pp. 1650034 ◽  
Author(s):  
Naim Al Mahi ◽  
Munni Begum

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.


2018 ◽  
Author(s):  
Cox Lwaka Tamba ◽  
Yuan-Ming Zhang

AbstractBackgroundRecent developments in technology result in the generation of big data. In genome-wide association studies (GWAS), we can get tens of million SNPs that need to be tested for association with a trait of interest. Indeed, this poses a great computational challenge. There is a need for developing fast algorithms in GWAS methodologies. These algorithms must ensure high power in QTN detection, high accuracy in QTN estimation and low false positive rate.ResultsHere, we accelerated mrMLM algorithm by using GEMMA idea, matrix transformations and identities. The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. All potentially associated QTNs with P-values ≤ 0.01 are evaluated in a multi-locus model by LARS algorithm and/or EM-Empirical Bayes. We call the algorithm FASTmrMLM. Numerical simulation studies and real data analysis validated the FASTmrMLM. FASTmrMLM reduces the running time in mrMLM by more than 50%. FASTmrMLM also shows high statistical power in QTN detection, high accuracy in QTN estimation and low false positive rate as compared to GEMMA, FarmCPU and mrMLM. Real data analysis shows that FASTmrMLM was able to detect more previously reported genes than all the other methods: GEMMA/EMMA, FarmCPU and mrMLM.ConclusionsFASTmrMLM is a fast and reliable algorithm in multi-locus GWAS and ensures high statistical power, high accuracy of estimates and low false positive rate.Author SummaryThe current developments in technology result in the generation of a vast amount of data. In genome-wide association studies, we can get tens of million markers that need to be tested for association with a trait of interest. Due to the computational challenge faced, we developed a fast algorithm for genome-wide association studies. Our approach is a two stage method. In the first step, we used matrix transformations and identities to quicken the testing of each random marker effect. The target functions and derivatives which are in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. In the second step, we selected all potentially associated SNPs and evaluated them in a multi-locus model. From simulation studies, our algorithm significantly reduces the computing time. The new method also shows high statistical power in detecting significant markers, high accuracy in marker effect estimation and low false positive rate. We also used the new method to identify relevant genes in real data analysis. We recommend our approach as a fast and reliable method for carrying out a multi-locus genome-wide association study.


2021 ◽  
Author(s):  
Qimin Zhang ◽  
Qian Shi ◽  
Mingfu Shao

AbstractTranscript assembly (i.e., to reconstruct the full-length expressed transcripts from RNA-seq data) has been a critical but yet unsolved step in RNA-seq analysis. Modern RNA-seq protocols can produce paired-/multiple-end RNA-seq reads, where information is available that two or more reads originate from the same transcript. The long-range constraints implied in these paired-/multiple-end reads can be much beneficial in correctly phasing the complicated spliced isoforms. However, there often exist gaps among individual ends, which may even contain junctions, making the efficient use of such constraints algorithmically challenging. Here we introduce Scallop2, a new reference-based transcript assembler optimized for multiple-end (including paired-end) RNA-seq data. Scallop2 uses an algorithmic frame-work that first represents reads from the same molecule as the so-called multiple-end phasing paths in the context of a splice graph, then “bridges” each multiple-end phasing path into a long, single-end phasing path, and finally decomposes the splice graph into paths (i.e., transcripts) guided by the bridged phasing paths. An efficient bridging algorithm is designed to infer the true path connecting two consecutive ends following a novel formulation that is robust to sequencing errors and transcript noises. By observing that failing to bridge two ends is mainly due to incomplete splice graphs, we propose a new method to determine false starting/ending vertices of the splice graphs which has been showed efficient in reducing false positive rate. Evaluations on both (multiple-end) single-cell RNA-seq datasets from Smart-seq3 protocol and Illumina paired-end RNA-seq samples demonstrate that Scallop2 vastly outperforms recent assemblers including StringTie2, Scallop, and CLASS2 in assembly accuracy.


1981 ◽  
Vol 74 (1) ◽  
pp. 41-43 ◽  
Author(s):  
I G Barrison ◽  
E R Littlewood ◽  
J Primavesi ◽  
A Sharpies ◽  
I T Gilmore ◽  
...  

Stools have been tested for occult gastrointestinal bleeding in 278 outpatients and 170 hospital inpatients using the Haemoccult and Haemastix methods. Seventeen outpatients (6.1%) and 42 inpatients (24.7%) were positive with the Haemoccult technique. Thirty-three outpatients (11.9%) and 93 inpatients (54.7%) were positive with the Haemastix test. Following investigation of the Haemoccult-positive patients, only 2 cases (3.4%) were considered false positives. However, the false positive rate with Haemastix was 22.9% which is unacceptable in a screening test. Haemoccult may be useful as a screening test for asymptomatic general practice patients, but a test of greater sensitivity is needed for hospital patients.


2018 ◽  
pp. 1-10
Author(s):  
Luke T. Lavallée ◽  
Rodney H. Breau ◽  
Dean Fergusson ◽  
Cynthia Walsh ◽  
Carl van Walraven

Purpose Administrative health data can be a valuable resource for health research. Because these data are not collected for research purposes, it is imperative that the accuracy of codes used to identify patients, exposures, and outcomes is measured. Patients and Methods Code sensitivity was determined by identifying a cohort of men with histologically confirmed prostate cancer in the Ontario Cancer Registry and linking them to the Ontario Health Insurance Plan (OHIP) to determine whether a prostate biopsy code had been claimed. Code specificity was estimated using a random sample of patients at The Ottawa Hospital for whom a prostate biopsy code was submitted to OHIP. A simulation model, which varied the code false-positive rate, true-negative rate, and proportion of code positives in the population, was created to determine specificity under a range of combinations of these parameters. Results Between 1991 and 2012, 97,369 of 148,669 men with histologically confirmed prostate cancer in the Ontario Cancer Registry had a prostate biopsy code in OHIP within 1 week of their diagnosis (code sensitivity, 86.0%). This increased significantly over time (63.8% in 1991 to 87.9% in 2012). The false-positive rate of the code for index prostate biopsies was 1.9%. The simulation model found that the code specificity exceeded 95% for first prostate biopsy but was lower for secondary biopsies because of more false positives. False positives primarily were related to placement of fiducial markers for patients who received radiotherapy. Conclusion Administrative data in Ontario can accurately identify men who receive a prostate biopsy. The code is less accurate for secondary biopsy procedures and their sequelae.


2014 ◽  
Vol 687-691 ◽  
pp. 2611-2617
Author(s):  
Hong Hai Zhou ◽  
Pei Bin Liu ◽  
Zhi Hao Jin

In this paper, a new method which is named DRNFD for network troubleshooting is brought forward in which “abnormal degree” is defined by the vector of probability and belief functions in a privileged process. A new formula based on Dempster Rule is presented to decrease false positives. This method (DRNFD) can effectively reduce false positive rate and non-response rate and can be applied to real-time fault diagnosis. The operational prototypical system demonstrates its feasibility and gets the effectiveness of real-time fault diagnosis.


Sign in / Sign up

Export Citation Format

Share Document