THiCweed: fast, sensitive detection of sequence features by clustering big data sets

Mapping Intimacies ◽

10.1101/104109 ◽

2017 ◽

Author(s):

Ankit Agrawal ◽

Snehal V. Sambare ◽

Leelavati Narlikar ◽

Rahul Siddharthan

Keyword(s):

Sequence Similarity ◽

Synthetic Data ◽

Real Data ◽

Motif Finding ◽

Data Sets ◽

Desktop Computer ◽

Sliding Windows ◽

Biologically Relevant ◽

Binding Data ◽

Chromatin Immunoprecipitation Sequencing

AbstractWe present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin-immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared towards data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30,000 peaks in 1-2 hours, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs.THiCweed performs best with large “window” sizes (≥ 50bp), much longer than typical binding sites (7-15 base pairs). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs, and secondary motifs even when they occur in < 5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq data sets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif-finding to give new insights into genomic TF binding complexity.

Download Full-text

Clustering Based on a Novel Density Estimation Method

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.748.590 ◽

2013 ◽

Vol 748 ◽

pp. 590-594

Author(s):

Li Liao ◽

Yong Gang Lu ◽

Xu Rong Chen

Keyword(s):

Density Estimation ◽

Nearest Neighbor ◽

Mean Shift ◽

Estimation Method ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Data Set

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.

Download Full-text

EFFICIENT UNSUPERVISED MINING FROM NOISY CO-OCCURRENCE DATA

New Mathematics and Natural Computation ◽

10.1142/s1793005705000093 ◽

2005 ◽

Vol 01 (01) ◽

pp. 173-193

Author(s):

HIROSHI MAMITSUKA

Keyword(s):

Protein Interactions ◽

Probabilistic Model ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Significance Level ◽

Input Space ◽

Occurrence Data ◽

Low Probability

We consider the problem of mining from noisy unsupervised data sets. The data point we call noise is an outlier in the current context of data mining, and it has been generally defined as the one locates in low probability regions of an input space. The purpose of the approach for this problem is to detect outliers and to perform efficient mining from noisy unsupervised data. We propose a new iterative sampling approach for this problem, using both model-based clustering and the likelihood given to each example by a trained probabilistic model for finding data points of such low probability regions in an input space. Our method uses an arbitrary probabilistic model as a component model and repeats two steps of sampling non-outliers with high likelihoods (computed by previously obtained models) and training the model with the selected examples alternately. In our experiments, we focused on two-mode and co-occurrence data and empirically evaluated the effectiveness of our proposed method, comparing with two other methods, by using both synthetic and real data sets. From the experiments using the synthetic data sets, we found that the significance level of the performance advantage of our method over the two other methods had more pronounced for higher noise ratios, for both medium- and large-sized data sets. From the experiments using a real noisy data set of protein–protein interactions, a typical co-occurrence data set, we further confirmed the performance of our method for detecting outliers from a given data set. Extended abstracts of parts of the work presented in this paper have appeared in Refs. 1 and 2.

Download Full-text

Porosity inference and classification of siliciclastic rocks from multiple data sets

Geophysics ◽

10.1190/1.2335570 ◽

2006 ◽

Vol 71 (5) ◽

pp. O65-O76 ◽

Cited By ~ 10

Author(s):

Luiz G. Loures ◽

Fernando S. Moraes

Keyword(s):

Gamma Ray ◽

Synthetic Data ◽

Rock Physics ◽

Real Data ◽

Calibration Procedure ◽

Data Sets ◽

Joint Inference ◽

Fluid Saturation ◽

Multiple Data ◽

Multiple Data Sets

We develop a Bayesian formulation for joint inference of porosity and clay volume, incorporating multiple data sets, prior information, and rock physics models. The derivation is carried out considering the full uncertainty involved in calculations from unknown hyperparameters required by either rock physics equations (model coefficients) or statistical models (data variances). Eventually, data variances are marginalized in closed form, and the model coefficients are fixed using a calibration procedure. To avoid working with a high-dimension probability density function in the parameter space, our formulation is derived and implemented using a moving window along the data domain. In thisway, we compute a collection of 2D posterior distributions forinterval porosity and clay volume, corresponding to each positionalong the window’s path. We test the methodology on both synthetic and real well logs consisting of gamma-ray, neutron, compressional and shear sonic velocity, and density. Tests demonstrate that integrating the relevant pieces of information about porosity and clay volume reduces the uncertainties associated with the estimates. Error analysis of a synthetic data example shows that neutron and density logs provide more information about porosity, whereas gamma-ray logs and velocities provide more information about clay volume. Additionally, we investigate a change in fluid saturation as a source of systematic error in porosity prediction. A real data example, incorporating porosity measurements on core samples, further demonstrates the consistency of our methodology in reducing the uncertainties associated with our final estimates.

Download Full-text

A New Information Entropy-Based Ant Clustering Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.87.101 ◽

2011 ◽

Vol 87 ◽

pp. 101-105

Author(s):

Wei Li Zhao ◽

Zhi Guo Zhang ◽

Zhi Jun Zhang

Keyword(s):

Information Entropy ◽

Clustering Algorithm ◽

Synthetic Data ◽

Real Data ◽

Misclassification Error ◽

Data Sets ◽

New Information ◽

Misclassification Error Rate ◽

Heuristic Clustering ◽

Entropy Functions

Ant-based clustering is a heuristic clustering method that draws its inspiration from the behavior of ants in nature. We revisit these methods in the context of a concrete application and introduce some modifications that yield significant improvements in terms of both quality and efficiency. In this paper, we propose a New Information Entropy-based Ant Clustering (NIEAC) algorithm. Firstly, we apply new information entropy to model behaviors of agents, such as picking up and dropping objects. The new entropy function led to better quality clusters than non-entropy functions. Secondly, we introduce a number of modifications that improve the quality of the clustering solutions generated by the algorithm. We have made some experiments on real data sets and synthetic data sets. The results demonstrate that our algorithm has superiority in misclassification error rate and runtime over the classical algorithm.

Download Full-text

Efficient Identification of Frequent Family Subtrees in Tree Database

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.241-244.3165 ◽

2012 ◽

Vol 241-244 ◽

pp. 3165-3170 ◽

Cited By ~ 2

Author(s):

Kyung Mi Lee ◽

Keon Myung Lee

Keyword(s):

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Labeled Trees ◽

New Type

This paper introduces a new type of problem called the frequent common family subtree mining problem for a collection of leaf-labeled trees and presents some characteristics for the problem. It proposes an algorithm to find frequent common families in trees. To its applicability, the proposed method has been applied to both several synthetic data sets and a real data set.

Download Full-text

Fast 2D inversion of large borehole EM induction data sets with an efficient Fréchet-derivative approximation

Geophysics ◽

10.1190/1.3033213 ◽

2009 ◽

Vol 74 (1) ◽

pp. E75-E91 ◽

Cited By ~ 9

Author(s):

Gong Li Wang ◽

Carlos Torres-Verdín ◽

Jesús M. Salazar ◽

Benjamin Voss

Keyword(s):

Electrical Resistivity ◽

Synthetic Data ◽

Large Data ◽

Inversion Method ◽

Central Component ◽

Data Sets ◽

Spatial Distributions ◽

Desktop Computer ◽

Data Set ◽

Uncertainty Estimator

In addition to reliability and stability, the efficiency and expediency of inversion methods have long been a strong concern for their routine applications by well-log interpreters. We have developed and successfully validated a new inversion method to estimate 2D parametric spatial distributions of electrical resistivity from array-induction measurements acquired in a vertical well. The central component of the method is an efficient approximation to Fréchet derivatives where both the incident and adjoint fields are precomputed and kept unchanged during inversion. To further enhance the overall efficiency of the inversion, we combined the new approximation with both the improved numerical mode-matching method and domain decomposition. Examples of application with synthetic data sets show that the new methodis computer efficient and capable of retrieving original model re-sistivities even in the presence of noise, performing equally well in both high and low contrasts of formation resistivity. In thin resistive beds, the new inversion method estimates more accurate resistivities than standard commercial deconvolution software. We also considered examples of application with field data sets that confirm the new method can successfully process a large data set that includes 200 beds in approximately [Formula: see text] of CPU time on a desktop computer. In addition to 2D parametric spatial distributions of electrical resistivity, the new inversion method provides a qualitative indicator of the uncertainty of estimated parameters based on the estimator’s covariance matrix. The uncertainty estimator provides a qualitative measure of the nonuniqueness of estimated resistivity parameters when the data misfit lies within the measurement error (noise).

Download Full-text

Simultaneous deconvolution and wavelet inversion as a global optimization

Geophysics ◽

10.1190/1.1444617 ◽

1999 ◽

Vol 64 (4) ◽

pp. 1108-1115 ◽

Cited By ~ 6

Author(s):

Warren T. Wood

Keyword(s):

Global Optimization ◽

Field Data ◽

Synthetic Data ◽

Data Sets ◽

Minimum Entropy ◽

Desktop Computer ◽

Highly Nonlinear ◽

Computer Environment ◽

Band Limited ◽

Wavelet Inversion

Estimates of the source wavelet and band‐limited earth reflectivity are obtained simultaneously from an optimization of deconvolution outputs, similar to minimum‐entropy deconvolution (MED). The only inputs required beyond the observed seismogram are wavelet length and an inversion parameter (cooling rate). The objective function to be minimized is a measure of the spikiness of the deconvolved seismogram. I assume that the wavelet whose deconvolution from the data results in the most spike‐like trace is the best wavelet estimate. Because this is a highly nonlinear problem, simulated annealing is used to solve it. The procedure yields excellent results on synthetic data and disparate field data sets, is robust in the presence of noise, and is fast enough to operate in a desktop computer environment.

Download Full-text

A method for independent estimation of false localisation rate for phosphoproteomics

10.1101/2021.10.18.464791 ◽

2021 ◽

Author(s):

Kerry A Ramsbottom ◽

Ananth A Prakash ◽

Yasset Perez-Riverol ◽

Oscar Martin Camacho ◽

Maria Martin ◽

...

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Statistical Control ◽

Data Set ◽

Statistical Reliability ◽

False Reporting ◽

Considerable Problem

Phosphoproteomics methods are commonly employed in labs to identify and quantify the sites of phosphorylation on proteins. In recent years, various software tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified, or to estimate the global false localisation rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic data sets, and their statistical reliability on real datasets is largely unknown. As a result, there is considerable problem in the field of reporting incorrectly localised phosphosites, due to inadequate statistical control. In this work, we develop the concept of using scoring and ranking modifications on a decoy amino acid, i.e. one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of different amino acids to act as the decoy, on both synthetic and real data sets, demonstrating that the amino acid selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys, although we have a preference for alanine due to the risk of potential confusion between leucine and isoleucine amino acids. We propose that the phosphoproteomics field should adopt the use of a decoy amino acid, so that there is better control of false reporting in the literature, and in public databases that re-distribute the data.

Download Full-text

A Kernel Independence Test for Geographical Language Variation

Computational Linguistics ◽

10.1162/coli_a_00293 ◽

2017 ◽

Vol 43 (3) ◽

pp. 567-592 ◽

Cited By ~ 3

Author(s):

Dong Nguyen ◽

Jacob Eisenstein

Keyword(s):

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Synthetic Data ◽

Language Variation ◽

Real Data ◽

Parametric Models ◽

Data Sets ◽

Test Statistic ◽

Letters To The Editor ◽

Data Set

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

Download Full-text

Measuring precision for deterministic and probabilistic record linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.110 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

Bindi Kindermann ◽

James Chipperfield ◽

Noel Hansen ◽

Peter Rossiter ◽

Jeffrey Wright

Keyword(s):

Record Linkage ◽

Quality Indicator ◽

Linked Data ◽

Census Data ◽

Planning Process ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Probabilistic Record Linkage ◽

Probabilistic Linkage

ABSTRACT ObjectivesVarious organisations are increasingly linking administrative, survey, and census data to enhance dimensions such as time and breadth or depth of detail. Because a unique person identifier is often not available, records belonging to two different people may be incorrectly linked. Estimating the proportion of links that are correct, called precision, is difficult because, even after clerical review, there will remain some uncertainty about whether a link is in fact correct or incorrect. This presentation proposes some methods for estimating precision when using either deterministic (rules-based) or probabilistic linkage. These methods are model-based and do not require clerical review. The main uses of these methods are to estimate: 1. Precision during the linking process. This is useful to refine how linkage is carried out, such as the choice of linking variables and weight thresholds. 2. Precision after the files are linked. This provides a useful "quality indicator" of the linked data. ApproachTwo methods of estimating precision are described: 1. Simulation – the linking process is simulated many times, whether it is probabilistic or deterministic. The key step being the simulation of the agreement pattern between data sets, based on underlying probabilities. 2. An algebraic estimator – this is applicable for deterministic linking only, and provides a quicker way of estimating precision. Both methods are investigated using two studies: (i) synthetic data (ii) real data (death registrations linked to census data). ResultsThe estimators perform very well using both the synthetic and real data, even when assumptions about the independence of linking variables are violated. This suggests that the estimators are robust against moderate violations of these assumptions. ConclusionThe proposed estimators of precision are a very useful addition to the record linkage tool kit, providing methodical, faster, and cheaper alternatives to many present strategies that rely on clerical review. Estimates of precision are useful in the planning, process, and analysis of record linkage activities.

Download Full-text