Modeling the Process of Event Sequence Data Generated for Working Condition Diagnosis

A likelihood ratio test for species membership based on DNA sequence data

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2005.1728 ◽

2005 ◽

Vol 360 (1462) ◽

pp. 1969-1974 ◽

Cited By ~ 68

Author(s):

Mikhail V Matz ◽

Rasmus Nielsen

Keyword(s):

Dna Barcoding ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Sequence Data ◽

A Priori ◽

Real Data ◽

Ratio Test ◽

Sequence Variability ◽

Dna Sequence Data ◽

Measure Of Uncertainty

DNA barcoding as an approach for species identification is rapidly increasing in popularity. However, it remains unclear which statistical procedures should accompany the technique to provide a measure of uncertainty. Here we describe a likelihood ratio test which can be used to test if a sampled sequence is a member of an a priori specified species. We investigate the performance of the test using coalescence simulations, as well as using the real data from butterflies and frogs representing two kinds of challenge for DNA barcoding: extremely low and extremely high levels of sequence variability.

Download Full-text

The APP procedure for estimating the Cohen's effect size

Asian Journal of Economics and Banking ◽

10.1108/ajeb-08-2021-0095 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Xiangfei Chen ◽

David Trafimow ◽

Tonghui Wang ◽

Tingting Tong ◽

Cong Wang

Keyword(s):

Computer Simulations ◽

Effect Size ◽

A Priori ◽

Real Data ◽

Computer Programs ◽

Data Sets ◽

Population Parameter ◽

Sample Sizes ◽

Content Type ◽

User Friendly

PurposeThe authors derive the necessary mathematics, provide computer simulations, provide links to free and user-friendly computer programs, and analyze real data sets.Design/methodology/approachCohen's d, which indexes the difference in means in standard deviation units, is the most popular effect size measure in the social sciences and economics. Not surprisingly, researchers have developed statistical procedures for estimating sample sizes needed to have a desirable probability of rejecting the null hypothesis given assumed values for Cohen's d, or for estimating sample sizes needed to have a desirable probability of obtaining a confidence interval of a specified width. However, for researchers interested in using the sample Cohen's d to estimate the population value, these are insufficient. Therefore, it would be useful to have a procedure for obtaining sample sizes needed to be confident that the sample. Cohen's d to be obtained is close to the population parameter the researcher wishes to estimate, an expansion of the a priori procedure (APP). The authors derive the necessary mathematics, provide computer simulations and links to free and user-friendly computer programs, and analyze real data sets for illustration of our main results.FindingsIn this paper, the authors answered the following two questions: The precision question: How close do I want my sample Cohen's d to be to the population value? The confidence question: What probability do I want to have of being within the specified distance?Originality/valueTo the best of the authors’ knowledge, this is the first paper for estimating Cohen's effect size, using the APP method. It is convenient for researchers and practitioners to use the online computing packages.

Download Full-text

No-U-Turn sampling for phylogenetic trees

10.1101/2021.03.16.435623 ◽

2021 ◽

Author(s):

Johannes Wahle

Keyword(s):

Monte Carlo ◽

Phylogenetic Trees ◽

Sequence Data ◽

Real Data ◽

Correlated Data ◽

Data Sets ◽

Acceptance Probability ◽

New States ◽

Gradient Based ◽

Efficient Exploration

The inference of phylogenetic trees from sequence data has become a staple in evolutionary research. Bayesian inference of such trees is predominantly based on the Metropolis-Hastings algorithm. For high dimensional and correlated data this algorithm is known to be inefficient. There are gradient based algorithms to speed up such inference. Building on recent research which uses gradient based approaches for the inference of phylogenetic trees in a Bayesian framework, I present an algorithm which is capable of performing No-U-Turn sampling for phylogenetic trees. As an extension to Hamiltonian Monte Carlo methods, No-U-Turn sampling comes with the same benefits, such as proposing distant new states with a high acceptance probability, but eliminates the need to manually tune hyper parameters. Evaluated on real data sets, the new sampler shows that it converges faster to the target distribution. The results also indicate that a higher number of topologies are traversed during sampling by the new algorithm in comparison to traditional Markov Chain Monte Carlo approaches. This new algorithm leads to a more efficient exploration of the posterior distribution of phylogenetic tree topologies.

Download Full-text

A Learning-Based EM Clustering for Circular Data with Unknown Number of Clusters

Proceedings of Engineering and Technology Innovation ◽

10.46604/peti.2020.5241 ◽

2020 ◽

Vol 15 ◽

pp. 42-51

Author(s):

Shou-Jen Chang-Chien ◽

Wajid Ali ◽

Miin-Shen Yang

Keyword(s):

Em Algorithm ◽

A Priori ◽

Real Data ◽

Circular Data ◽

Data Sets ◽

Number Of Clusters ◽

The Em Algorithm ◽

Von Mises ◽

Migrating Birds ◽

Em Clustering

Clustering is a method for analyzing grouped data. Circular data were well used in various applications, such as wind directions, departure directions of migrating birds or animals, etc. The expectation & maximization (EM) algorithm on mixtures of von Mises distributions is popularly used for clustering circular data. In general, the EM algorithm is sensitive to initials and not robust to outliers in which it is also necessary to give a number of clusters a priori. In this paper, we consider a learning-based schema for EM, and then propose a learning-based EM algorithm on mixtures of von Mises distributions for clustering grouped circular data. The proposed clustering method is without any initial and robust to outliers with automatically finding the number of clusters. Some numerical and real data sets are used to compare the proposed algorithm with existing methods. Experimental results and comparisons actually demonstrate these good aspects of effectiveness and superiority of the proposed learning-based EM algorithm.

Download Full-text

Pointwise Mutual Information based Graph Laplacian Regularized Sparse Unmixing

10.36227/techrxiv.16831330 ◽

2021 ◽

Author(s):

Sefa Kucuk ◽

Seniha Esen Yuksel

Keyword(s):

Mutual Information ◽

Contextual Information ◽

A Priori ◽

Real Data ◽

Graph Laplacian ◽

Image Features ◽

Data Sets ◽

Statistical Framework ◽

Laplacian Regularization ◽

Pointwise Mutual Information

Sparse unmixing (SU) aims to express the observed image signatures as a linear combination of pure spectra known a priori and has become a very popular technique with promising results in analyzing hyperspectral images (HSI) over the past ten years. In SU, utilizing the spatial-contextual information allows for more realistic abundance estimation. To make full use of the spatial-spectral information, in this letter, we propose a pointwise mutual information (PMI) based graph Laplacian regularization for SU. Specifically, we construct the affinity matrices via PMI by modeling the association between neighboring image features through a statistical framework, and then we use them in the graph Laplacian regularizer. We also adopt a double reweighted $\ell_{1}$ norm minimization scheme to promote the sparsity of fractional abundances. Experimental results on simulated and real data sets prove the effectiveness of the proposed method and its superiority over competing algorithms in the literature.

Download Full-text

Pointwise Mutual Information based Graph Laplacian Regularized Sparse Unmixing

10.36227/techrxiv.16831330.v1 ◽

2021 ◽

Author(s):

Sefa Kucuk ◽

Seniha Esen Yuksel

Keyword(s):

Mutual Information ◽

Contextual Information ◽

A Priori ◽

Real Data ◽

Graph Laplacian ◽

Image Features ◽

Data Sets ◽

Statistical Framework ◽

Laplacian Regularization ◽

Pointwise Mutual Information

Sparse unmixing (SU) aims to express the observed image signatures as a linear combination of pure spectra known a priori and has become a very popular technique with promising results in analyzing hyperspectral images (HSI) over the past ten years. In SU, utilizing the spatial-contextual information allows for more realistic abundance estimation. To make full use of the spatial-spectral information, in this letter, we propose a pointwise mutual information (PMI) based graph Laplacian regularization for SU. Specifically, we construct the affinity matrices via PMI by modeling the association between neighboring image features through a statistical framework, and then we use them in the graph Laplacian regularizer. We also adopt a double reweighted $\ell_{1}$ norm minimization scheme to promote the sparsity of fractional abundances. Experimental results on simulated and real data sets prove the effectiveness of the proposed method and its superiority over competing algorithms in the literature.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text