Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Mapping Intimacies ◽

10.1101/020784 ◽

2015 ◽

Cited By ~ 7

Author(s):

David M Rocke ◽

Luyao Ruan ◽

Yilun Zhang ◽

J. Jared Gossett ◽

Blythe Durbin-Johnson ◽

...

Keyword(s):

Linear Model ◽

False Positive ◽

Negative Binomial ◽

False Positive Rate ◽

Real Data ◽

False Positives ◽

P Value ◽

Data Sets ◽

Rna Seq ◽

Positive Rate

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

Download Full-text

The Study of the Ontology and Context Verification Based Intrusion Detection Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.644-650.3338 ◽

2014 ◽

Vol 644-650 ◽

pp. 3338-3341 ◽

Cited By ~ 1

Author(s):

Guang Feng Guo

Keyword(s):

Intrusion Detection ◽

Knowledge Base ◽

False Positive ◽

Intrusion Detection System ◽

Detection System ◽

False Positive Rate ◽

False Positives ◽

The Real ◽

Detection Model ◽

Positive Rate

During the 30-year development of the Intrusion Detection System, the problems such as the high false-positive rate have always plagued the users. Therefore, the ontology and context verification based intrusion detection model (OCVIDM) was put forward to connect the description of attack’s signatures and context effectively. The OCVIDM established the knowledge base of the intrusion detection ontology that was regarded as the center of efficient filtering platform of the false alerts to realize the automatic validation of the alarm and self-acting judgment of the real attacks, so as to achieve the goal of filtering the non-relevant positives alerts and reduce false positives.

Download Full-text

Pulse oximetry screening for detection of congenital heart defects at 1646 m in Albuquerque, New Mexico

Cardiology in the Young ◽

10.1017/s1047951120002899 ◽

2020 ◽

Vol 30 (12) ◽

pp. 1851-1855

Author(s):

Sruti Rao ◽

M. B. Goens ◽

Orrin B. Myers ◽

Emilie A. Sebesta

Keyword(s):

New Mexico ◽

Pulse Oximetry ◽

Sea Level ◽

False Positive ◽

False Positive Rate ◽

False Positives ◽

Moderate Altitude ◽

High False Positive Rate ◽

Positive Rate ◽

Pulse Oximetry Screening

AbstractAim:To determine the false-positive rate of pulse oximetry screening at moderate altitude, presumed to be elevated compared with sea level values and assess change in false-positive rate with time.Methods:We retrospectively analysed 3548 infants in the newborn nursery in Albuquerque, New Mexico, (elevation 5400 ft) from July 2012 to October 2013. Universal pulse oximetry screening guidelines were employed after 24 hours of life but before discharge. Newborn babies between 36 and 36 6/7 weeks of gestation, weighing >2 kg and babies >37 weeks weighing >1.7 kg were included in the study. Log-binomial regression was used to assess change in the probability of false positives over time.Results:Of the 3548 patients analysed, there was one true positive with a posteriorly-malaligned ventricular septal defect and an interrupted aortic arch. Of the 93 false positives, the mean pre- and post-ductal saturations were lower, 92 and 90%, respectively. The false-positive rate before April 2013 was 3.5% and after April 2013, decreased to 1.5%. There was a significant decrease in false-positive rate (p = 0.003, slope coefficient = −0.082, standard error of coefficient = 0.023) with the relative risk of a false positive decreasing at 0.92 (95% CI 0.88–0.97) per month.Conclusion:This is the first study in Albuquerque, New Mexico, reporting a high false-positive rate of 1.5% at moderate altitude at the end of the study in comparison to the false-positive rate of 0.035% at sea level. Implementation of the nationally recommended universal pulse oximetry screening was associated with a high false-positive rate in the initial period, thought to be from the combination of both learning curve and altitude. After the initial decline, it remained steadily elevated above sea level, indicating the dominant effect of moderate altitude.

Download Full-text

A Partial Correlation Screening Approach for Controlling the False Positive Rate in Sparse Gaussian Graphical Models

Scientific Reports ◽

10.1038/s41598-019-53795-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Ginette Lafit ◽

Francis Tuerlinckx ◽

Inez Myin-Germeys ◽

Eva Ceulemans

Keyword(s):

Graphical Models ◽

False Positive ◽

Partial Correlation ◽

State Of The Art ◽

False Positive Rate ◽

False Positives ◽

Gaussian Graphical Models ◽

Undirected Network ◽

Partial Correlations ◽

Positive Rate

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.

Download Full-text

Gene Selection using a High-Dimensional Regression Model with Microarrays in Cancer Prognostic Studies

Cancer Informatics ◽

10.4137/cin.s9048 ◽

2012 ◽

Vol 11 ◽

pp. CIN.S9048 ◽

Cited By ~ 4

Author(s):

Shuhei Kaneko ◽

Akihiro Hirakawa ◽

Chikuma Hamada

Keyword(s):

False Positive ◽

Cross Validation ◽

Gene Selection ◽

Cox Model ◽

False Positive Rate ◽

Real Data ◽

Tuning Parameter ◽

High Dimensional ◽

Positive Rate ◽

Selection Operator

Mining of gene expression data to identify genes associated with patient survival is an ongoing problem in cancer prognostic studies using microarrays in order to use such genes to achieve more accurate prognoses. The least absolute shrinkage and selection operator (lasso) is often used for gene selection and parameter estimation in high-dimensional microarray data. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. We propose a method for estimating the false positive rate (FPR) for lasso estimates in a high-dimensional Cox model. We performed a simulation study to examine the precision of the FPR estimate by the proposed method. We applied the proposed method to real data and illustrated the identification of false positive genes.

Download Full-text

A two-step integrated approach to detect differentially expressed genes in RNA-Seq data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016500347 ◽

2016 ◽

Vol 14 (06) ◽

pp. 1650034 ◽

Cited By ~ 1

Author(s):

Naim Al Mahi ◽

Munni Begum

Keyword(s):

Simulation Study ◽

Negative Binomial ◽

Real Data ◽

Integrated Approach ◽

Differentially Expressed ◽

Data Sets ◽

Rna Seq ◽

Software Packages ◽

Treatment Conditions ◽

Integrated Approaches

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.

Download Full-text

A fast mrMLM algorithm for multi-locus genome-wide association studies

10.1101/341784 ◽

2018 ◽

Cited By ~ 23

Author(s):

Cox Lwaka Tamba ◽

Yuan-Ming Zhang

Keyword(s):

False Positive ◽

Statistical Power ◽

Association Studies ◽

False Positive Rate ◽

Real Data ◽

High Accuracy ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Positive Rate

AbstractBackgroundRecent developments in technology result in the generation of big data. In genome-wide association studies (GWAS), we can get tens of million SNPs that need to be tested for association with a trait of interest. Indeed, this poses a great computational challenge. There is a need for developing fast algorithms in GWAS methodologies. These algorithms must ensure high power in QTN detection, high accuracy in QTN estimation and low false positive rate.ResultsHere, we accelerated mrMLM algorithm by using GEMMA idea, matrix transformations and identities. The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. All potentially associated QTNs with P-values ≤ 0.01 are evaluated in a multi-locus model by LARS algorithm and/or EM-Empirical Bayes. We call the algorithm FASTmrMLM. Numerical simulation studies and real data analysis validated the FASTmrMLM. FASTmrMLM reduces the running time in mrMLM by more than 50%. FASTmrMLM also shows high statistical power in QTN detection, high accuracy in QTN estimation and low false positive rate as compared to GEMMA, FarmCPU and mrMLM. Real data analysis shows that FASTmrMLM was able to detect more previously reported genes than all the other methods: GEMMA/EMMA, FarmCPU and mrMLM.ConclusionsFASTmrMLM is a fast and reliable algorithm in multi-locus GWAS and ensures high statistical power, high accuracy of estimates and low false positive rate.Author SummaryThe current developments in technology result in the generation of a vast amount of data. In genome-wide association studies, we can get tens of million markers that need to be tested for association with a trait of interest. Due to the computational challenge faced, we developed a fast algorithm for genome-wide association studies. Our approach is a two stage method. In the first step, we used matrix transformations and identities to quicken the testing of each random marker effect. The target functions and derivatives which are in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. In the second step, we selected all potentially associated SNPs and evaluated them in a multi-locus model. From simulation studies, our algorithm significantly reduces the computing time. The new method also shows high statistical power in detecting significant markers, high accuracy in marker effect estimation and low false positive rate. We also used the new method to identify relevant genes in real data analysis. We recommend our approach as a fast and reliable method for carrying out a multi-locus genome-wide association study.

Download Full-text

Scallop2 enables accurate assembly of multiple-end RNA-seq data

10.1101/2021.09.03.458862 ◽

2021 ◽

Author(s):

Qimin Zhang ◽

Qian Shi ◽

Mingfu Shao

Keyword(s):

False Positive ◽

False Positive Rate ◽

Rna Seq ◽

Sequencing Errors ◽

True Path ◽

Assembly Accuracy ◽

Splice Graph ◽

Positive Rate ◽

Frame Work ◽

Transcript Assembly

AbstractTranscript assembly (i.e., to reconstruct the full-length expressed transcripts from RNA-seq data) has been a critical but yet unsolved step in RNA-seq analysis. Modern RNA-seq protocols can produce paired-/multiple-end RNA-seq reads, where information is available that two or more reads originate from the same transcript. The long-range constraints implied in these paired-/multiple-end reads can be much beneficial in correctly phasing the complicated spliced isoforms. However, there often exist gaps among individual ends, which may even contain junctions, making the efficient use of such constraints algorithmically challenging. Here we introduce Scallop2, a new reference-based transcript assembler optimized for multiple-end (including paired-end) RNA-seq data. Scallop2 uses an algorithmic frame-work that first represents reads from the same molecule as the so-called multiple-end phasing paths in the context of a splice graph, then “bridges” each multiple-end phasing path into a long, single-end phasing path, and finally decomposes the splice graph into paths (i.e., transcripts) guided by the bridged phasing paths. An efficient bridging algorithm is designed to infer the true path connecting two consecutive ends following a novel formulation that is robust to sequencing errors and transcript noises. By observing that failing to bridge two ends is mainly due to incomplete splice graphs, we propose a new method to determine false starting/ending vertices of the splice graphs which has been showed efficient in reducing false positive rate. Evaluations on both (multiple-end) single-cell RNA-seq datasets from Smart-seq3 protocol and Illumina paired-end RNA-seq samples demonstrate that Scallop2 vastly outperforms recent assemblers including StringTie2, Scallop, and CLASS2 in assembly accuracy.

Download Full-text

Screening for Occult Gastrointestinal Bleeding in Hospital Patients

Journal of the Royal Society of Medicine ◽

10.1177/014107688107400107 ◽

1981 ◽

Vol 74 (1) ◽

pp. 41-43 ◽

Cited By ~ 1

Author(s):

I G Barrison ◽

E R Littlewood ◽

J Primavesi ◽

A Sharpies ◽

I T Gilmore ◽

...

Keyword(s):

General Practice ◽

Gastrointestinal Bleeding ◽

False Positive ◽

Screening Test ◽

False Positive Rate ◽

False Positives ◽

Occult Gastrointestinal Bleeding ◽

Hospital Patients ◽

Positive Rate ◽

Hospital Inpatients

Stools have been tested for occult gastrointestinal bleeding in 278 outpatients and 170 hospital inpatients using the Haemoccult and Haemastix methods. Seventeen outpatients (6.1%) and 42 inpatients (24.7%) were positive with the Haemoccult technique. Thirty-three outpatients (11.9%) and 93 inpatients (54.7%) were positive with the Haemastix test. Following investigation of the Haemoccult-positive patients, only 2 cases (3.4%) were considered false positives. However, the false positive rate with Haemastix was 22.9% which is unacceptable in a screening test. Haemoccult may be useful as a screening test for asymptomatic general practice patients, but a test of greater sensitivity is needed for hospital patients.

Download Full-text

Can We Use Administrative Data to Accurately Identify Patients Who Receive a Prostate Biopsy?

JCO Clinical Cancer Informatics ◽

10.1200/cci.17.00143 ◽

2018 ◽

pp. 1-10

Author(s):

Luke T. Lavallée ◽

Rodney H. Breau ◽

Dean Fergusson ◽

Cynthia Walsh ◽

Carl van Walraven

Keyword(s):

Prostate Cancer ◽

Simulation Model ◽

Administrative Data ◽

Cancer Registry ◽

Prostate Biopsy ◽

False Positive ◽

False Positive Rate ◽

False Positives ◽

Ontario Cancer Registry ◽

Positive Rate

Purpose Administrative health data can be a valuable resource for health research. Because these data are not collected for research purposes, it is imperative that the accuracy of codes used to identify patients, exposures, and outcomes is measured. Patients and Methods Code sensitivity was determined by identifying a cohort of men with histologically confirmed prostate cancer in the Ontario Cancer Registry and linking them to the Ontario Health Insurance Plan (OHIP) to determine whether a prostate biopsy code had been claimed. Code specificity was estimated using a random sample of patients at The Ottawa Hospital for whom a prostate biopsy code was submitted to OHIP. A simulation model, which varied the code false-positive rate, true-negative rate, and proportion of code positives in the population, was created to determine specificity under a range of combinations of these parameters. Results Between 1991 and 2012, 97,369 of 148,669 men with histologically confirmed prostate cancer in the Ontario Cancer Registry had a prostate biopsy code in OHIP within 1 week of their diagnosis (code sensitivity, 86.0%). This increased significantly over time (63.8% in 1991 to 87.9% in 2012). The false-positive rate of the code for index prostate biopsies was 1.9%. The simulation model found that the code specificity exceeded 95% for first prostate biopsy but was lower for secondary biopsies because of more false positives. False positives primarily were related to placement of fiducial markers for patients who received radiotherapy. Conclusion Administrative data in Ontario can accurately identify men who receive a prostate biopsy. The code is less accurate for secondary biopsy procedures and their sequelae.

Download Full-text

A Network Troubleshooting Method Based on Dempster Rule

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.2611 ◽

2014 ◽

Vol 687-691 ◽

pp. 2611-2617

Author(s):

Hong Hai Zhou ◽

Pei Bin Liu ◽

Zhi Hao Jin

Keyword(s):

Fault Diagnosis ◽

Real Time ◽

False Positive ◽

Response Rate ◽

False Positive Rate ◽

False Positives ◽

Belief Functions ◽

Positive Rate ◽

New Formula ◽

Network Troubleshooting

In this paper, a new method which is named DRNFD for network troubleshooting is brought forward in which “abnormal degree” is defined by the vector of probability and belief functions in a privileged process. A new formula based on Dempster Rule is presented to decrease false positives. This method (DRNFD) can effectively reduce false positive rate and non-response rate and can be applied to real-time fault diagnosis. The operational prototypical system demonstrates its feasibility and gets the effectiveness of real-time fault diagnosis.

Download Full-text