Smooth Quantile Normalization

Mapping Intimacies ◽

10.1101/085175 ◽

2016 ◽

Cited By ~ 8

Author(s):

Stephanie C Hicks ◽

Kwame Okrah ◽

Joseph N Paulson ◽

John Quackenbush ◽

Rafael A Irizarry ◽

...

Keyword(s):

High Throughput ◽

External Information ◽

Systematic Bias ◽

Quantile Normalization ◽

Statistical Distributions ◽

Monte Carlo Simulation Study ◽

Normalization Methods ◽

Or Groups ◽

Genomic Data Analysis ◽

Scaling Features

AbstractBetween-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions. We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff of qsmooth compared to other global normalization methods. A software implementation is available from https://github.com/stephaniehicks/qsmooth.

Download Full-text

When to use Quantile Normalization?

10.1101/012203 ◽

2014 ◽

Cited By ~ 9

Author(s):

Stephanie C. Hicks ◽

Rafael A. Irizarry

Keyword(s):

R Package ◽

Current Approach ◽

Global Changes ◽

Quantile Normalization ◽

Subject Matter Experts ◽

Monte Carlo Simulation Study ◽

Multiple Gene ◽

Normalization Methods ◽

The Subject ◽

Generation Sequencing

Normalization and preprocessing are essential steps for the analysis of high-throughput data including next-generation sequencing and microarrays. Multi-sample global normalization methods, such as quantile normalization, have been successfully used to remove technical variation from noisy data. These methods rely on the assumption that observed global changes across samples are due to unwanted technical variability. Transforming the data to remove these differences has the potential to remove interesting biologically driven global variation and therefore may not be appropriate depending on the type and source of variation. Currently, it is up to the subject matter experts, for example biologists, to determine if the stated assumptions are appropriate or not. Here, we propose a data-driven method to test for the assumptions of global normalization methods. We demonstrate the utility of our method (quantro), by applying it to multiple gene expression and DNA methylation and show examples of when global normalization methods are not appropriate. We also perform a Monte Carlo simulation study to illustrate how our method generally outperforms the current approach. An R-package implementing our method is available on Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/quantro.html).

Download Full-text

Normalization for Affymetrix GeneChips

Methods of Information in Medicine ◽

10.1055/s-0038-1633986 ◽

2005 ◽

Vol 44 (03) ◽

pp. 414-417 ◽

Cited By ~ 11

Author(s):

M. Neuhäuser ◽

T. Boes

Keyword(s):

Biomedical Research ◽

Computing Time ◽

Normalization Method ◽

Point Of View ◽

Data Sets ◽

Quantile Normalization ◽

Biological Variability ◽

Biological Origin ◽

Affymetrix Genechips ◽

Normalization Methods

Summary Objectives: The high density oligonucleotide micro-arrays from Affymetrix (Affymetrix GeneChips) are very popular in biomedical research. They enable to study the expression of thousands of genes simultaneously. In experiments with multiple arrays, normalization techniques are used to reduce the so-called obscuring variation, i.e. the technical variation that is of non-biological origin. Several different normalization methods have been proposed during the last years. Methods: We review published results about the comparison of normalization methods proposed for Affymetrix GeneChips. Results: The quantile normalization seems to perform favorably regarding precision (low variance), accuracy (low bias), and practicability (low computing time). However, according to very recent results [1], this normalization method can have an impact on the biological variability and, therefore, appears to be less than optimal from this point of view. Conclusion: Although the quantile normalization may be recommendable, more investigations based on more data sets are needed so that the different normalization methods can be evaluated on widely differing data.

Download Full-text

A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data

10.1101/026062 ◽

2015 ◽

Cited By ~ 2

Author(s):

Rahul Reddy

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

High Throughput ◽

High Throughput Sequencing ◽

Differential Expression Analysis ◽

Simulated Data ◽

Sequencing Data ◽

Technical Variability ◽

Expression Studies ◽

Normalization Methods

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.

Download Full-text

normGAM: an R package to remove systematic biases in genome architecture mapping data

BMC Genomics ◽

10.1186/s12864-019-6331-8 ◽

2019 ◽

Vol 20 (S12) ◽

Cited By ~ 2

Author(s):

Tong Liu ◽

Zheng Wang

Keyword(s):

Fragment Length ◽

R Package ◽

Genome Architecture ◽

Systematic Bias ◽

Length Bias ◽

Detection Frequency ◽

Normalization Methods ◽

New Type ◽

Systematic Biases ◽

Better Than

Abstract Background The genome architecture mapping (GAM) technique can capture genome-wide chromatin interactions. However, besides the known systematic biases in the raw GAM data, we have found a new type of systematic bias. It is necessary to develop and evaluate effective normalization methods to remove all systematic biases in the raw GAM data. Results We have detected a new type of systematic bias, the fragment length bias, in the genome architecture mapping (GAM) data, which is significantly different from the bias of window detection frequency previously mentioned in the paper introducing the GAM method but is similar to the bias of distances between restriction sites existing in raw Hi-C data. We have found that the normalization method (a normalized variant of the linkage disequilibrium) used in the GAM paper is not able to effectively eliminate the new fragment length bias at 1 Mb resolution (slightly better at 30 kb resolution). We have developed an R package named normGAM for eliminating the new fragment length bias together with the other three biases existing in raw GAM data, which are the biases related to window detection frequency, mappability, and GC content. Five normalization methods have been implemented and included in the R package including Knight-Ruiz 2-norm (KR2, newly designed by us), normalized linkage disequilibrium (NLD), vanilla coverage (VC), sequential component normalization (SCN), and iterative correction and eigenvector decomposition (ICE). Conclusions Based on our evaluations, the five normalization methods can eliminate the four biases existing in raw GAM data, with VC and KR2 performing better than the others. We have observed that the KR2-normalized GAM data have a higher correlation with the KR-normalized Hi-C data on the same cell samples indicating that the KR-related methods are better than the others for keeping the consistency between the GAM and Hi-C experiments. Compared with the raw GAM data, the normalized GAM data are more consistent with the normalized distances from the fluorescence in situ hybridization (FISH) experiments. The source code of normGAM can be freely downloaded from http://dna.cs.miami.edu/normGAM/.

Download Full-text

A New Extended-X Family of Distributions: Properties and Applications

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/4650520 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Mi Zichuan ◽

Saddam Hussain ◽

Anum Iftikhar ◽

Muhammad Ilyas ◽

Zubair Ahmad ◽

...

Keyword(s):

Real Data ◽

Model Parameters ◽

Data Sets ◽

Statistical Distributions ◽

Reliability Engineering ◽

Monte Carlo Simulation Study ◽

Diverse Range ◽

Heavy Tailed ◽

Log Normal ◽

Extended Weibull Distribution

During the past couple of years, statistical distributions have been widely used in applied areas such as reliability engineering, medical, and financial sciences. In this context, we come across a diverse range of statistical distributions for modeling heavy tailed data sets. Well-known distributions are log-normal, log-t, various versions of Pareto, log-logistic, Weibull, gamma, exponential, Rayleigh and its variants, and generalized beta of the second kind distributions, among others. In this paper, we try to supplement the distribution theory literature by incorporating a new model, called a new extended Weibull distribution. The proposed distribution is very flexible and exhibits desirable properties. Maximum likelihood estimators of the model parameters are obtained, and a Monte Carlo simulation study is conducted to assess the behavior of these estimators. Finally, we provide a comparative study of the newly proposed and some other existing methods via analyzing three real data sets from different disciplines such as reliability engineering, medical, and financial sciences. It has been observed that the proposed method outclasses well-known distributions on the basis of model selection criteria.

Download Full-text

Analysis and correction of compositional bias in sparse sequencing count data

10.1101/142851 ◽

2017 ◽

Cited By ~ 2

Author(s):

M. Senthil Kumar ◽

Eric V. Slud ◽

Kwame Okrah ◽

Stephanie C. Hicks ◽

Sridhar Hannenhalli ◽

...

Keyword(s):

Dna Sequencing ◽

High Throughput ◽

Count Data ◽

Empirical Bayes ◽

Compositional Bias ◽

Molecular Assays ◽

Normalization Methods ◽

Scaling Methods ◽

High Throughput Dna Sequencing ◽

Sequencing Process

AbstractCount data derived from high-throughput DNA sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it.

Download Full-text

Control-Plate Regression (CPR) Normalization for High-Throughput Screens with Many Active Features

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057113516003 ◽

2013 ◽

Vol 19 (5) ◽

pp. 661-671 ◽

Cited By ~ 11

Author(s):

C. Murie ◽

C. Barette ◽

L. Lafanechère ◽

R. Nadon

Keyword(s):

Systematic Error ◽

High Throughput ◽

Measurement Accuracy ◽

High Performing ◽

Large Numbers ◽

Normalization Methods ◽

Control Plate ◽

Difficult Challenge ◽

High Throughput Screens

Systematic error is present in all high-throughput screens, lowering measurement accuracy. Because screening occurs at the early stages of research projects, measurement inaccuracy leads to following up inactive features and failing to follow up active features. Current normalization methods take advantage of the fact that most primary-screen features (e.g., compounds) within each plate are inactive, which permits robust estimates of row and column systematic-error effects. Screens that contain a majority of potentially active features pose a more difficult challenge because even the most robust normalization methods will remove at least some of the biological signal. Control plates that contain the same feature in all wells can provide a solution to this problem by providing well-by-well estimates of systematic error, which can then be removed from the treatment plates. We introduce the robust control-plate regression (CPR) method, which uses this approach. CPR’s performance is compared to a high-performing primary-screen normalization method in four experiments. These data were also perturbed to simulate screens with large numbers of active features to further assess CPR’s performance. CPR performs almost as well as the best performing normalization methods with primary screens and outperforms the Z-score and equivalent methods with screens containing a large proportion of active features.

Download Full-text

Detecting and overcoming systematic bias in high-throughput screening technologies: a comprehensive review of practical issues and methodological solutions

Briefings in Bioinformatics ◽

10.1093/bib/bbv004 ◽

2015 ◽

Vol 16 (6) ◽

pp. 974-986 ◽

Cited By ~ 18

Author(s):

I. Caraus ◽

A. A. Alsuwailem ◽

R. Nadon ◽

V. Makarenkov

Keyword(s):

High Throughput ◽

High Throughput Screening ◽

Systematic Bias ◽

Comprehensive Review

Download Full-text

A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

Briefings in Bioinformatics ◽

10.1093/bib/bbs046 ◽

2012 ◽

Vol 14 (6) ◽

pp. 671-683 ◽

Cited By ~ 643

Author(s):

M.-A. Dillies ◽

A. Rau ◽

J. Aubert ◽

C. Hennequet-Antier ◽

M. Jeanmougin ◽

...

Keyword(s):

Data Analysis ◽

Rna Sequencing ◽

High Throughput ◽

Comprehensive Evaluation ◽

Sequencing Data ◽

Normalization Methods ◽

Sequencing Data Analysis

Download Full-text

Systematic bias in high-throughput sequencing data and its correction by BEADS

Nucleic Acids Research ◽

10.1093/nar/gkr425 ◽

2011 ◽

Vol 39 (15) ◽

pp. e103-e103 ◽

Cited By ~ 107

Author(s):

Ming-Sin Cheung ◽

Thomas A. Down ◽

Isabel Latorre ◽

Julie Ahringer

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Systematic Bias ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text