Hierarchical Clustering Approach for Selecting Representative Skylines

Lkhagvadorj Battulga; Aziz Nasridinov

doi:10.3390/info10030096

Hierarchical Clustering Approach for Selecting Representative Skylines

Information ◽

10.3390/info10030096 ◽

2019 ◽

Vol 10 (3) ◽

pp. 96

Author(s):

Lkhagvadorj Battulga ◽

Aziz Nasridinov

Keyword(s):

Data Distribution ◽

Computation Time ◽

Agglomerative Clustering ◽

Skyline Query ◽

Big Data Applications ◽

Wide Range ◽

Hierarchical Agglomerative Clustering ◽

Data Points ◽

Low Dimensional ◽

Representative Skyline

Recently, the skyline query has attracted interest in a wide range of applications from recommendation systems to computer networks. The skyline query is useful to obtain the dominant data points from the given dataset. In the low-dimensional dataset, the skyline query may return a small number of skyline points. However, as the dimensionality of the dataset increases, the number of skyline points also increases. In other words, depending on the data distribution and dimensionality, most of the data points may become skyline points. With the emergence of big data applications, where the data distribution and dimensionality are a significant problem, obtaining representative skyline points among resulting skyline points is necessary. There have been several methods that focused on extracting representative skyline points with various success. However, existing methods have a problem of re-computation when the global threshold changes. Moreover, in certain cases, the resulting representative skyline points may not satisfy a user with multiple preferences. Thus, in this paper, we propose a new representative skyline query processing method, called representative skyline cluster (RSC), which solves the problems of the existing methods. Our method utilizes the hierarchical agglomerative clustering method to find the exact representative skyline points, which enable us to reduce the re-computation time significantly. We show the superiority of our proposed method over the existing state-of-the-art methods with various types of experiments.

Download Full-text

Distributionally Adversarial Attack

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012253 ◽

2019 ◽

Vol 33 ◽

pp. 2253-2260 ◽

Cited By ~ 10

Author(s):

Tianhang Zheng ◽

Changyou Chen ◽

Kui Ren

Keyword(s):

Data Distribution ◽

Original Data ◽

First Order ◽

Risk Optimization ◽

Wide Range ◽

Data Points ◽

Adversarial Attack ◽

Projected Gradient Descent ◽

Original Objective ◽

Direct Dependency

Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution p(x), typically in the form of risk maximization/minimization, e.g., max/min Ep(x) L(x) with p(x) some unknown data distribution and L(·) a loss function. However, since PGD generates attack samples independently for each data sample based on L(·), the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal adversarial-data distribution, a perturbed distribution that satisfies the L∞ constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking state-of-the-art defense models, including the adversarially-trained models provided by MIT MadryLab. Notably, DAA ranks the first place on MadryLab’s white-box leaderboards, reducing the accuracy of their secret MNIST model to 88.56% (with l∞ perturbations of ε = 0.3) and the accuracy of their secret CIFAR model to 44.71% (with l∞ perturbations of ε = 8.0). Code for the experiments is released on https://github.com/tianzheng4/Distributionally-Adversarial-Attack.

Download Full-text

Hydra: a method for strain-minimizing hyperbolic embedding of network- and distance-based data

Journal of Complex Networks ◽

10.1093/comnet/cnaa002 ◽

2020 ◽

Vol 8 (1) ◽

Cited By ~ 2

Author(s):

Martin Keller-Ressel ◽

Stephanie Nargang

Keyword(s):

Computation Time ◽

Feature Space ◽

Hyperbolic Distance ◽

Network Data ◽

Data Points ◽

Low Dimensional ◽

Hyperbolic Embedding ◽

Embedding Methods ◽

Real Network Data

Abstract We introduce hydra (hyperbolic distance recovery and approximation), a new method for embedding network- or distance-based data into hyperbolic space. We show mathematically that hydra satisfies a certain optimality guarantee: it minimizes the ‘hyperbolic strain’ between original and embedded data points. Moreover, it is able to recover points exactly, when they are contained in a low-dimensional hyperbolic subspace of the feature space. Testing on real network data we show that the embedding quality of hydra is competitive with existing hyperbolic embedding methods, but achieved at substantially shorter computation time. An extended method, termed hydra+, typically outperforms existing methods in both computation time and embedding quality.

Download Full-text

Deriving Taxonomy from Documents at Sentence Level

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch005 ◽

2008 ◽

pp. 99-119 ◽

Cited By ~ 6

Author(s):

Ying Liu ◽

Han Tong Loh ◽

Wen Feng Lu

Keyword(s):

Experimental Study ◽

Semantic Information ◽

Bag Of Words ◽

Agglomerative Clustering ◽

Profile Model ◽

Word Sequence ◽

Sentence Level ◽

Wide Range ◽

Hierarchical Agglomerative Clustering ◽

Sequence Method

This chapter introduces an approach of deriving taxonomy from documents using a novel document profile model that enables document representations with the semantic information systematically generated at the document sentence level. A frequent word sequence method is proposed to search for the salient semantic information and has been integrated into the document profile model. The experimental study of taxonomy generation using hierarchical agglomerative clustering has shown a significant improvement in terms of Fscore based on the document profile model. A close examination reveals that the integration of semantic information has a clear contribution compared to the classic bag-of-words approach. This study encourages us to further investigate the possibility of applying document profile model over a wide range of text based mining tasks.

Download Full-text

When Choosing the Best Subset Is Not the Best Choice

10.21203/rs.3.rs-743866/v1 ◽

2021 ◽

Author(s):

Moritz Hanke ◽

Louis Dijkstra ◽

Ronja Foraita ◽

Vanessa Didelez

Keyword(s):

Variable Selection ◽

Signal To Noise Ratio ◽

Computation Time ◽

Mixed Integer ◽

Integer Optimization ◽

Signal To Noise ◽

Wide Range ◽

Noise Ratio ◽

Low Dimensional ◽

Measure Of Performance

Abstract Background: Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered as an intuitively appealing ‘gold standard’, with its use being restricted mainly by its N P-hard nature. Instead, alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. This has been exploited to study the prediction performance of BSS and its competitors. Here, we present an extensive simulation study assessing, instead, the variable selection performance of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The analysis considers a wide range of settings that are challenging with regard to dimensionality, signal-to-noise ratio and correlations between relevant and irrelevant direct predictors. As measure of performance we used the best possible F1 score for each method so as to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters.Results: Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods. This was the case even in low dimensional settings where the number of observations exceeded the number of variables by a factor of ten. Further, the FSS approach performed nearly identically to BSS. Conclusion: Our results shed a new light on the usual presumption of BSS being, in principle, the best choice for variable selection. More attention needs to be payed to the data generating process when considering variable selection methods. Especially for correlated variables, convex alternatives like Enet are not only faster but also appear to be more accurate in practical settings.

Download Full-text

Clustering Methods for Defect Tracking in Order to Assess the Performance of a Porosity Inspection System

Manufacturing Science and Engineering, Parts A and B ◽

10.1115/msec2006-21135 ◽

2006 ◽

Author(s):

Rachel N. Rubin ◽

J. Patrick Spicer ◽

Reuven R. Katz

Keyword(s):

A Priori ◽

Computation Time ◽

Repeated Measurements ◽

Inspection System ◽

Surface Porosity ◽

Clustering Methods ◽

Agglomerative Clustering ◽

Hierarchical Agglomerative Clustering ◽

Dimensional Measurements ◽

Time Required

Surface porosity inspection is important for quality assurance of critical mating surfaces on machined components. An important metric for assessing the performance of an automated surface porosity inspection system is repeatability. Traditional gage repeatability analysis is well defined for dimensional measurements of machined part features. However, the analysis becomes more difficult for surface porosity inspection. This is because surface porosity appears in random sizes and in random locations. Repeatability analysis requires painstaking effort in tracking individual pores through repeated measurements. Therefore, this paper presents an automated approach for tracking porosity for the purpose of repeatability analysis. Two different algorithms are proposed and evaluated. The first is a tolerance based method that uses pre-specified tolerances to determine if pores should be grouped together. The second algorithm is similar to hierarchical agglomerative clustering, using a similarity matrix to store differences between cluster centroids. However, this algorithm uses a training period to determine when to stop clustering instead of continuing until all pores are in one cluster. Experimental results describe differences in the accuracy of both approaches and effort required to obtain a solution. The computation time required for the first method is much shorter than that of the second method. However, the first algorithm requires a-priori information to specify the tolerances, whereas the second algorithm requires no prior knowledge.

Download Full-text

Geometry and Physics: Volume II

10.1093/oso/9780198802020.001.0001 ◽

2018 ◽

Keyword(s):

Moduli Spaces ◽

Mirror Symmetry ◽

Geometric Analysis ◽

Mathematical Physics ◽

Poisson Geometry ◽

Special Holonomy ◽

Low Dimensional Topology ◽

Generalized Complex Structures ◽

Wide Range ◽

Low Dimensional

These volumes contain the proceedings of the conference held at Aarhus, Oxford and Madrid in September 2016 to mark the seventieth birthday of Nigel Hitchin, one of the world’s foremost geometers and Savilian Professor of Geometry at Oxford. The proceedings contain twenty-nine articles, including three by Fields medallists (Donaldson, Mori and Yau). The articles cover a wide range of topics in geometry and mathematical physics, including the following: Riemannian geometry, geometric analysis, special holonomy, integrable systems, dynamical systems, generalized complex structures, symplectic and Poisson geometry, low-dimensional topology, algebraic geometry, moduli spaces, Higgs bundles, geometric Langlands programme, mirror symmetry and string theory. These volumes will be of interest to researchers and graduate students both in geometry and mathematical physics.

Download Full-text

Clustering Techniques for Secondary Substations Siting

Energies ◽

10.3390/en14041028 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1028

Author(s):

Silvia Corigliano ◽

Federico Rosato ◽

Carla Ortiz Dominguez ◽

Marco Merlo

Keyword(s):

Rural Areas ◽

Urban Areas ◽

Universal Access ◽

Distribution Networks ◽

Industrialized Countries ◽

Agglomerative Clustering ◽

Clustering Techniques ◽

Hierarchical Agglomerative Clustering ◽

Efficient Planning ◽

Target Set

The scientific community is active in developing new models and methods to help reach the ambitious target set by UN SDGs7: universal access to electricity by 2030. Efficient planning of distribution networks is a complex and multivariate task, which is usually split into multiple subproblems to reduce the number of variables. The present work addresses the problem of optimal secondary substation siting, by means of different clustering techniques. In contrast with the majority of approaches found in the literature, which are devoted to the planning of MV grids in already electrified urban areas, this work focuses on greenfield planning in rural areas. K-means algorithm, hierarchical agglomerative clustering, and a method based on optimal weighted tree partitioning are adapted to the problem and run on two real case studies, with different population densities. The algorithms are compared in terms of different indicators useful to assess the feasibility of the solutions found. The algorithms have proven to be effective in addressing some of the crucial aspects of substations siting and to constitute relevant improvements to the classic K-means approach found in the literature. However, it is found that it is very challenging to conjugate an acceptable geographical span of the area served by a single substation with a substation power high enough to justify the installation when the load density is very low. In other words, well known standards adopted in industrialized countries do not fit with developing countries’ requirements.

Download Full-text

Identifying organ dysfunction trajectory-based subphenotypes in critically ill patients with COVID-19

Scientific Reports ◽

10.1038/s41598-021-95431-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chang Su ◽

Zhenxing Xu ◽

Katherine Hoffman ◽

Parag Goyal ◽

Monika M. Safford ◽

...

Keyword(s):

New York ◽

Respiratory Failure ◽

Sofa Score ◽

Severity Of Illness ◽

Agglomerative Clustering ◽

Baseline Severity ◽

Organ Systems ◽

Hierarchical Agglomerative Clustering ◽

Dynamic Time ◽

Post Intubation

AbstractCOVID-19-associated respiratory failure offers the unprecedented opportunity to evaluate the differential host response to a uniform pathogenic insult. Understanding whether there are distinct subphenotypes of severe COVID-19 may offer insight into its pathophysiology. Sequential Organ Failure Assessment (SOFA) score is an objective and comprehensive measurement that measures dysfunction severity of six organ systems, i.e., cardiovascular, central nervous system, coagulation, liver, renal, and respiration. Our aim was to identify and characterize distinct subphenotypes of COVID-19 critical illness defined by the post-intubation trajectory of SOFA score. Intubated COVID-19 patients at two hospitals in New York city were leveraged as development and validation cohorts. Patients were grouped into mild, intermediate, and severe strata by their baseline post-intubation SOFA. Hierarchical agglomerative clustering was performed within each stratum to detect subphenotypes based on similarities amongst SOFA score trajectories evaluated by Dynamic Time Warping. Distinct worsening and recovering subphenotypes were identified within each stratum, which had distinct 7-day post-intubation SOFA progression trends. Patients in the worsening suphenotypes had a higher mortality than those in the recovering subphenotypes within each stratum (mild stratum, 29.7% vs. 10.3%, p = 0.033; intermediate stratum, 29.3% vs. 8.0%, p = 0.002; severe stratum, 53.7% vs. 22.2%, p < 0.001). Pathophysiologic biomarkers associated with progression were distinct at each stratum, including findings suggestive of inflammation in low baseline severity of illness versus hemophagocytic lymphohistiocytosis in higher baseline severity of illness. The findings suggest that there are clear worsening and recovering subphenotypes of COVID-19 respiratory failure after intubation, which are more predictive of outcomes than baseline severity of illness. Distinct progression biomarkers at differential baseline severity of illness suggests a heterogeneous pathobiology in the progression of COVID-19 respiratory failure.

Download Full-text

Modelling with star-shaped distributions

Dependence Modeling ◽

10.1515/demo-2020-0003 ◽

2020 ◽

Vol 8 (1) ◽

pp. 45-69

Author(s):

Eckhard Liebscher ◽

Wolf-Dieter Richter

Keyword(s):

Probabilistic Models ◽

Arbitrary Dimension ◽

Probability Density Functions ◽

Density Functions ◽

Wide Range ◽

Gaussian Density ◽

Spherical Distributions ◽

Data Points ◽

Non Gaussian ◽

General Method

AbstractWe prove and describe in great detail a general method for constructing a wide range of multivariate probability density functions. We introduce probabilistic models for a large variety of clouds of multivariate data points. In the present paper, the focus is on star-shaped distributions of an arbitrary dimension, where in case of spherical distributions dependence is modeled by a non-Gaussian density generating function.

Download Full-text

Data-Driven Analysis of Engine Mission Severity Using Non-Dimensional Groups

10.1115/gt2021-58673 ◽

2021 ◽

Author(s):

Tim Brandes ◽

Stefano Scarso ◽

Christian Koch ◽

Stephan Staudacher

Keyword(s):

Support Vector Machine ◽

Computation Time ◽

Principal Component ◽

General Term ◽

Support Vector ◽

Physical Parameters ◽

Machine Model ◽

Precision Error ◽

Operational Conditions ◽

Wide Range

Abstract A numerical experiment of intentionally reduced complexity is used to demonstrate a method to classify flight missions in terms of the operational severity experienced by the engines. In this proof of concept, the general term of severity is limited to the erosion of the core flow compressor blade and vane leading edges. A Monte Carlo simulation of varying operational conditions generates a required database of 10000 flight missions. Each flight is sampled at a rate of 1 Hz. Eleven measurable or synthesizable physical parameters are deemed to be relevant for the problem. They are reduced to seven universal non-dimensional groups which are averaged for each flight. The application of principal component analysis allows a further reduction to three principal components. They are used to run a support-vector machine model in order to classify the flights. A linear kernel function is chosen for the support-vector machine due to its low computation time compared to other functions. The robustness of the classification approach against measurement precision error is evaluated. In addition, a minimum number of flights required for training and a sensible number of severity classes are documented. Furthermore, the importance to train the algorithms on a sufficiently wide range of operations is presented.

Download Full-text