A nearest neighbour approach by genic distance to the assignment of individuals to geographic origin

Mapping Intimacies ◽

10.1101/087833 ◽

2016 ◽

Author(s):

Bernd Degen ◽

Céline Blanc-Jolivet ◽

Katrin Stierand ◽

Elizabeth Gillet

Keyword(s):

Isolation By Distance ◽

Geographic Origin ◽

Simulated Data ◽

Data Sets ◽

Nearest Neighbour ◽

Bayesian Approaches ◽

Data Set ◽

Assignment Method ◽

Assignment Analysis ◽

Reference Samples

AbstractDuring the past decade, the use of DNA for forensic applications has been extensively implemented for plant and animal species, as well as in humans. Tracing back the geographical origin of an individual usually requires genetic assignment analysis. These approaches are based on reference samples that are grouped into populations or other aggregates and intend to identify the most likely group of origin. Often this grouping does not have a biological but rather a historical or political justification, such as “country of origin”.In this paper, we present a new nearest neighbour approach to individual assignment or classification within a given but potentially imperfect grouping of reference samples. This method, which is based on the genic distance between individuals, functions better in many cases than commonly used methods. We demonstrate the operation of our assignment method using two data sets. One set is simulated for a large number of trees distributed in a 120 km by 120 km landscape with individual genotypes at 150 SNPs, and the other set comprises experimental data of 1221 individuals of the African tropical tree species Entandrophragma cylindricum (Sapelli) genotyped at 61 SNPs. Judging by the level of correct self-assignment, our approach outperformed the commonly used frequency and Bayesian approaches by 15% for the simulated data set and by 5 to 7% for the Sapelli data set.Our new approach is less sensitive to overlapping sources of genetic differentiation, such as genic differences among closely-related species, phylogeographic lineages and isolation by distance, and thus operates better even for suboptimal grouping of individuals.

Download Full-text

Variational inference using approximate likelihood under the coalescent with recombination

Genome Research ◽

10.1101/gr.273631.120 ◽

2021 ◽

pp. gr.273631.120

Author(s):

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Simulated Data ◽

Variational Inference ◽

Divide And Conquer ◽

Data Sets ◽

Transition Rates ◽

Data Set ◽

Population Sizes ◽

Novel Method ◽

Approximate Likelihood ◽

Promising Avenue

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.

Download Full-text

Uncertainty-Aware Deep Classifiers Using Generative Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6015 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5620-5627 ◽

Cited By ~ 1

Author(s):

Murat Sensoy ◽

Lance Kaplan ◽

Federico Cerutti ◽

Maryam Saleki

Keyword(s):

Neural Networks ◽

Epistemic Uncertainty ◽

Feature Space ◽

Generative Models ◽

Detection Methods ◽

Generative Adversarial Networks ◽

Data Sets ◽

Bayesian Approaches ◽

Data Set ◽

Auxiliary Data

Deep neural networks are often ignorant about what they do not know and overconfident when they make uninformed predictions. Some recent approaches quantify classification uncertainty directly by training the model to output high uncertainty for the data samples close to class boundaries or from the outside of the training distribution. These approaches use an auxiliary data set during training to represent out-of-distribution samples. However, selection or creation of such an auxiliary data set is non-trivial, especially for high dimensional data such as images. In this work we develop a novel neural network model that is able to express both aleatoric and epistemic uncertainty to distinguish decision boundary and out-of-distribution regions of the feature space. To this end, variational autoencoders and generative adversarial networks are incorporated to automatically generate out-of-distribution exemplars for training. Through extensive analysis, we demonstrate that the proposed approach provides better estimates of uncertainty for in- and out-of-distribution samples, and adversarial examples on well-known data sets against state-of-the-art approaches including recent Bayesian approaches for neural networks and anomaly detection methods.

Download Full-text

Feature Scaling via Second-Order Cone Programming

Mathematical Problems in Engineering ◽

10.1155/2016/7347986 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7

Author(s):

Zhizheng Liang

Keyword(s):

Simulated Data ◽

Performance Measure ◽

Second Order ◽

Data Sets ◽

Scaling Factors ◽

Data Set ◽

Cone Programming ◽

Second Order Cone Programming ◽

Second Order Cone ◽

Feature Scaling

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

MSIGNET: a Metropolis sampling-based method for global optimal significant network identification

10.1101/260844 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xi Chen ◽

Jianhua Xuan

Keyword(s):

Cancer Recurrence ◽

Simulated Data ◽

Superior Performance ◽

Biological Knowledge ◽

Specific Gene ◽

Data Sets ◽

Data Set ◽

Novel Approach ◽

Network Identification ◽

Global Optimal

AbstractIn this paper, we propose a novel approach namely MSIGNET to identify subnetworks with significantly expressed genes by integrating context specific gene expression and protein-protein interaction (PPI) data. Specifically, we integrate differential expression of each gene and mutual information of gene pairs in a Bayesian framework and use Metropolis sampling to identify functional interactions. During the sampling process, a conditional probability is calculated given a randomly selected gene to control the network state transition. Our method provides global statistics of all genes and their interactions, and finally achieves a global optimal sub-network. We apply MSIGNET to simulated data and have demonstrated its superior performance over comparable network identification tools. Using a validated Parkinson data set we show that the network identified using MSIGNET is consistent to previously reported results but provides more biology meaningful interpretation of Parkinson’s disease. Finally, to study networks related to ovarian cancer recurrence, we investigate two patient data sets. Identified networks from independent data sets show functional consistence. And those common genes and interactions are well supported by current biological knowledge.

Download Full-text

Comparison of silhouette-based reallocation methods for vegetation classification

10.1101/630384 ◽

2019 ◽

Cited By ~ 1

Author(s):

Attila Lengyel ◽

David W. Roberts ◽

Zoltán Botta-Dukát

Keyword(s):

Simulated Data ◽

Primary Objective ◽

Vegetation Classification ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Silhouette Width ◽

Diagnostic Species ◽

Order Of Magnitude ◽

Initial Classification

AbstractAimsTo introduce REMOS, a new iterative reallocation method (with two variants) for vegetation classification, and to compare its performance with OPTSIL. We test (1) how effectively REMOS and OPTSIL maximize mean silhouette width and minimize the number of negative silhouette widths when run on classifications with different structure; (2) how these three methods differ in runtime with different sample sizes; and (3) if classifications by the three reallocation methods differ in the number of diagnostic species, a surrogate for interpretability.Study areaSimulation; example data sets from grasslands in Hungary and forests in Wyoming and Utah, USA.MethodsWe classified random subsets of simulated data with the flexible-beta algorithm for different values of beta. These classifications were subsequently optimized by REMOS and OPTSIL and compared for mean silhouette widths and proportion of negative silhouette widths. Then, we classified three vegetation data sets of different sizes from two to ten clusters, optimized them with the reallocation methods, and compared their runtimes, mean silhouette widths, numbers of negative silhouette widths, and the number of diagnostic species.ResultsIn terms of mean silhouette width, OPTSIL performed the best when the initial classifications already had high mean silhouette width. REMOS algorithms had slightly lower mean silhouette width than what was maximally achievable with OPTSIL but their efficiency was consistent across different initial classifications; thus REMOS was significantly superior to OPTSIL when the initial classification had low mean silhouette width. REMOS resulted in zero or a negligible number of negative silhouette widths across all classifications. OPTSIL performed similarly when the initial classification was effective but could not reach as low proportion of misclassified objects when the initial classification was inefficient. REMOS algorithms were typically more than an order of magnitude faster to calculate than OPTSIL. There was no clear difference between REMOS and OPTSIL in the number of diagnostic species.ConclusionsREMOS algorithms may be preferable to OPTSIL when (1) the primary objective is to reduce or eliminate negative silhouette widths in a classification, (2) the initial classification has low mean silhouette width, or (3) when the time efficiency of the algorithm is important because of the size of the data set or the high number of clusters.

Download Full-text

Model-free detection of unique events in time series

Scientific Reports ◽

10.1038/s41598-021-03526-y ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Zsigmond Benkő ◽

Tamás Bábel ◽

Zoltán Somogyvári

Keyword(s):

Simulated Data ◽

Normal Activity ◽

Detection Algorithm ◽

Superior Performance ◽

Data Series ◽

Data Sets ◽

Data Set ◽

Detection Algorithms ◽

Model Free ◽

Unique Event

AbstractRecognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called “unicorn” or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.

Download Full-text

Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity

Management Science ◽

10.1287/mnsc.2020.3680 ◽

2021 ◽

Author(s):

Gah-Yi Ban ◽

N. Bora Keskin

Keyword(s):

Dynamic Pricing ◽

Simulated Data ◽

The United States ◽

Model Parameters ◽

Data Sets ◽

Demand Model ◽

Pricing Policy ◽

Data Set ◽

Customized Pricing ◽

The Individual

We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized demand model, parameters of which depend on s out of the d features. The seller initially does not know the relationship between the customer features and the product demand but learns this through sales observations over a selling horizon of T periods. We prove that the seller’s expected regret, that is, the revenue loss against a clairvoyant who knows the underlying demand relationship, is at least of order [Formula: see text] under any admissible policy. We then design a near-optimal pricing policy for a semiclairvoyant seller (who knows which s of the d features are in the demand model) who achieves an expected regret of order [Formula: see text]. We extend this policy to a more realistic setting, where the seller does not know the true demand predictors, and show that this policy has an expected regret of order [Formula: see text], which is also near-optimal. Finally, we test our theory on simulated data and on a data set from an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is superior to intuitive and/or widely-practiced customized pricing methods, such as myopic pricing and segment-then-optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in expected revenue over a six-month period. This paper was accepted by Noah Gans, stochastic models and simulation.

Download Full-text

The Impact of Site Sample Size on the Reconstruction of Culture Histories

American Antiquity ◽

10.7183/0002-7316.76.3.547 ◽

2011 ◽

Vol 76 (3) ◽

pp. 547-572 ◽

Cited By ~ 3

Author(s):

Charles Perreault

Keyword(s):

Simulated Data ◽

Archaeological Sites ◽

Cultural Tradition ◽

Data Sets ◽

Data Set ◽

Rate Of Spread ◽

Accuracy And Precision ◽

Historical Processes ◽

The Impact ◽

Simulated Data Sets

I examine how our capacity to produce accurate culture-historical reconstructions changes as more archaeological sites are discovered, dated, and added to a data set. More precisely, I describe, using simulated data sets, how increases in the number of known sites impact the accuracy and precision of our estimations of (1) the earliest and (2) latest date of a cultural tradition, (3) the date and (4) magnitude of its peak popularity, as well as (5) its rate of spread and (6) disappearance in a population. I show that the accuracy and precision of inferences about these six historical processes are not affected in the same fashion by changes in the number of known sites. I also consider the impact of two simple taphonomic site destruction scenarios on the results. Overall, the results presented in this paper indicate that unless we are in possession of near-total samples of sites, and can be certain that there are no taphonomic biases in the universe of sites to be sampled, we will make inferences of varying precision and accuracy depending on the aspect of a cultural trait’s history in question.

Download Full-text

Characterizing and Comparing Phylogenetic Trait Data from Their Normalized Laplacian Spectrum

Systematic Biology ◽

10.1093/sysbio/syz061 ◽

2019 ◽

Vol 69 (2) ◽

pp. 234-248

Author(s):

Eric Lewitus ◽

Leandro Aristide ◽

Hélène Morlon

Keyword(s):

Resource Use ◽

Simulated Data ◽

Graph Laplacian ◽

Molecular Data ◽

Nonparametric Analysis ◽

Data Sets ◽

New World Monkeys ◽

Phenotypic Evolution ◽

Laplacian Spectrum ◽

Data Set

Abstract The dissection of the mode and tempo of phenotypic evolution is integral to our understanding of global biodiversity. Our ability to infer patterns of phenotypes across phylogenetic clades is essential to how we infer the macroevolutionary processes governing those patterns. Many methods are already available for fitting models of phenotypic evolution to data. However, there is currently no comprehensive nonparametric framework for characterizing and comparing patterns of phenotypic evolution. Here, we build on a recently introduced approach for using the phylogenetic spectral density profile (SDP) to compare and characterize patterns of phylogenetic diversification, in order to provide a framework for nonparametric analysis of phylogenetic trait data. We show how to construct the SDP of trait data on a phylogenetic tree from the normalized graph Laplacian. We demonstrate on simulated data the utility of the SDP to successfully cluster phylogenetic trait data into meaningful groups and to characterize the phenotypic patterning within those groups. We furthermore demonstrate how the SDP is a powerful tool for visualizing phenotypic space across traits and for assessing whether distinct trait evolution models are distinguishable on a given empirical phylogeny. We illustrate the approach in two empirical data sets: a comprehensive data set of traits involved in song, plumage, and resource-use in tanagers, and a high-dimensional data set of endocranial landmarks in New World monkeys. Considering the proliferation of morphometric and molecular data collected across the tree of life, we expect this approach will benefit big data analyses requiring a comprehensive and intuitive framework.

Download Full-text