Cluster Analysis Based on Bipartite Network

Mathematical Problems in Engineering ◽

10.1155/2014/676427 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Dawei Zhang ◽

Fuding Xie ◽

Dapeng Wang ◽

Yong Zhang ◽

Yan Sun

Keyword(s):

Artificial Intelligence ◽

Real Data ◽

Fuzzy Cluster ◽

Bipartite Network ◽

Data Sets ◽

Cluster Number ◽

Wide Range ◽

Class Information ◽

Clustering Data ◽

Optimal Cluster

Clustering data has a wide range of applications and has attracted considerable attention in data mining and artificial intelligence. However it is difficult to find a set of clusters that best fits natural partitions without any class information. In this paper, a method for detecting the optimal cluster number is proposed. The optimal cluster number can be obtained by the proposal, while partitioning the data into clusters by FCM (Fuzzyc-means) algorithm. It overcomes the drawback of FCM algorithm which needs to define the cluster numbercin advance. The method works by converting the fuzzy cluster result into a weighted bipartite network and then the optimal cluster number can be detected by the improved bipartite modularity. The experimental results on artificial and real data sets show the validity of the proposed method.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

Clusterdv, a simple density-based clustering method that is robust, general and automatic

10.1101/224840 ◽

2017 ◽

Author(s):

João C. Marques ◽

Michael B. Orger

Keyword(s):

Clustering Algorithm ◽

Underlying Structure ◽

Data Sets ◽

Natural Phenomena ◽

Cluster Number ◽

Data Set ◽

Density Peaks ◽

Wide Range ◽

Cluster Shape ◽

Fully Automatic

AbstractHow to partition a data set into a set of distinct clusters is a ubiquitous and challenging problem. The fact that data varies widely in features such as cluster shape, cluster number, density distribution, background noise, outliers and degree of overlap, makes it difficult to find a single algorithm that can be broadly applied. One recent method, clusterdp, based on search of density peaks, can be applied successfully to cluster many kinds of data, but it is not fully automatic, and fails on some simple data distributions. We propose an alternative approach, clusterdv, which estimates density dips between points, and allows robust determination of cluster number and distribution across a wide range of data, without any manual parameter adjustment. We show that this method is able to solve a range of synthetic and experimental data sets, where the underlying structure is known, and identifies consistent and meaningful clusters in new behavioral data.Author summarIt is common that natural phenomena produce groupings, or clusters, in data, that can reveal the underlying processes. However, the form of these clusters can vary arbitrarily, making it challenging to find a single algorithm that identifies their structure correctly, without prior knowledge of the number of groupings or their distribution. We describe a simple clustering algorithm that is fully automatic and is able to correctly identify the number and shape of groupings in data of many types. We expect this algorithm to be useful in finding unknown natural phenomena present in data from a wide range of scientific fields.

Download Full-text

A parallel data clustering algorithm for Intel MIC accelerators

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v20r211 ◽

2019 ◽

pp. 104-115

Author(s):

Т.В. Речкалов ◽

М.Л. Цымблер

Keyword(s):

Dna Microarrays ◽

Clustering Algorithm ◽

Real Data ◽

Data Sets ◽

Data Layout ◽

Partitioning Around Medoids ◽

Wide Range ◽

Input Dataset ◽

Intel Mic ◽

Many Integrated Core

Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризации, в котором в качестве центров кластеров выбираются только кластеризуемые объекты (медоиды). Кластеризация на основе техники медоидов применяется в широком спектре приложений: сегментирование медицинских и спутниковых изображений, анализ ДНК-микрочипов и текстов и др. На сегодня имеются параллельные реализации PAM для систем GPU и FPGA, но отсутствуют таковые для многоядерных ускорителей архитектуры Intel Many Integrated Core (MIC). В настоящей статье предлагается новый параллельный алгоритм кластеризации PhiPAM для ускорителей Intel MIC. Вычисления распараллеливаются с помощью технологии OpenMP. Алгоритм предполагает использование специализированной компоновки данных в памяти и техники тайлинга, позволяющих эффективно векторизовать вычисления на системах Intel MIC. Эксперименты, проведенные на реальных наборах данных, показали хорошую масштабируемость алгоритма. The PAM (Partitioning Around Medoids) is a partitioning clustering algorithm where each cluster is represented by an object from the input dataset (called a medoid). The medoid-based clustering is used in a wide range of applications: the segmentation of medical and satellite images, the analysis of DNA microarrays and texts, etc. Currently, there are parallel implementations of PAM for GPU and FPGA systems, but not for Intel Many Integrated Core (MIC) accelerators. In this paper, we propose a novel parallel PhiPAM clustering algorithm for Intel MIC systems. Computations are parallelized by the OpenMP technology. The algorithm exploits a sophisticated memory data layout and loop tiling technique, which allows one to efficiently vectorize computations with Intel MIC. Experiments performed on real data sets show a good scalability of the algorithm.

Download Full-text

Application of artificial intelligence for Euler solutions clustering

Geophysics ◽

10.1190/1.1543204 ◽

2003 ◽

Vol 68 (1) ◽

pp. 168-180 ◽

Cited By ~ 45

Author(s):

Valentine Mikhailov ◽

Armand Galdeano ◽

Michel Diament ◽

Alexei Gvishiani ◽

Sergei Agayan ◽

...

Keyword(s):

Artificial Intelligence ◽

Large Data ◽

Real Data ◽

Data Sets ◽

Geometrical Approach ◽

Tectonic Structures ◽

Clustering Technique ◽

Anomalous Field ◽

Definition Of ◽

Structural Indices

Results of Euler deconvolution strongly depend on the selection of viable solutions. Synthetic calculations using multiple causative sources show that Euler solutions cluster in the vicinity of causative bodies even when they do not group densely about the perimeter of the bodies. We have developed a clustering technique to serve as a tool for selecting appropriate solutions. The clustering technique uses a methodology based on artificial intelligence, and it was originally designed to classify large data sets. It is based on a geometrical approach to study object concentration in a finite metric space of any dimension. The method uses a formal definition of cluster and includes free parameters that search for clusters of given properties. Tests on synthetic and real data showed that the clustering technique successfully outlines causative bodies more accurately than other methods used to discriminate Euler solutions. In complex field cases, such as the magnetic field in the Gulf of Saint Malo region (Brittany, France), the method provides dense clusters, which more clearly outline possible causative sources. In particular, it allows one to trace offshore the main inland tectonic structures and to study their interrelationships in the Gulf of Saint Malo. The clusters provide solutions associated with particular bodies, or parts of bodies, allowing the analysis of different clusters of Euler solutions separately. This may allow computation of average parameters for individual causative bodies. Those measurements of the anomalous field that yield clusters also form dense clusters themselves. Application of this clustering technique thus outlines areas where the influence of different causative sources is more prominent. This allows one to focus on these areas for more detailed study, using different window sizes, structural indices, etc.

Download Full-text

Global tests for novelty

Statistical Methods in Medical Research ◽

10.1177/0962280215591236 ◽

2015 ◽

Vol 26 (4) ◽

pp. 1867-1880

Author(s):

Ilmari Ahonen ◽

Denis Larocque ◽

Jaakko Nevalainen

Keyword(s):

Novelty Detection ◽

Null Distribution ◽

Real Data ◽

Training Data ◽

Data Sets ◽

Screening Experiments ◽

Wide Range ◽

Global Tests ◽

New Treatment ◽

The Individual

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.

Download Full-text

Wind Shear Prediction from Light Detection and Ranging Data Using Machine Learning Methods

Atmosphere ◽

10.3390/atmos12050644 ◽

2021 ◽

Vol 12 (5) ◽

pp. 644

Author(s):

Jingyan Huang ◽

Michael Kwok Po Ng ◽

Pak Wai Chan

Keyword(s):

Wind Shear ◽

Warning System ◽

Real Data ◽

Warning Signal ◽

Light Detection And Ranging ◽

Data Sets ◽

Light Detection ◽

Learning Methods ◽

Statistical Indicator ◽

Wide Range

The main aim of this paper is to propose a statistical indicator for wind shear prediction from Light Detection and Ranging (LIDAR) observational data. Accurate warning signal of wind shear is particularly important for aviation safety. The main challenges are that wind shear may result from a sustained change of the headwind and the possible velocity of wind shear may have a wide range. Traditionally, aviation models based on terrain-induced setting are used to detect wind shear phenomena. Different from traditional methods, we study a statistical indicator which is used to measure the variation of headwinds from multiple headwind profiles. Because the indicator value is nonnegative, a decision rule based on one-side normal distribution is employed to distinguish wind shear cases and non-wind shear cases. Experimental results based on real data sets obtained at Hong Kong International Airport runway are presented to demonstrate that the proposed indicator is quite effective. The prediction performance of the proposed method is better than that by the supervised learning methods (LDA, KNN, SVM, and logistic regression). This model would also provide more accurate warnings of wind shear for pilots and improve the performance of Wind shear and Turbulence Warning System.

Download Full-text

Explaining predictive models using Shapley values and non-parametric vine copulas

Dependence Modeling ◽

10.1515/demo-2021-0103 ◽

2021 ◽

Vol 9 (1) ◽

pp. 62-81

Author(s):

Kjersti Aas ◽

Thomas Nagler ◽

Martin Jullum ◽

Anders Løland

Keyword(s):

Traditional Approach ◽

Simulated Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Shapley Values ◽

Non Gaussian ◽

Vine Copula ◽

Vine Copulas

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.

Download Full-text

Identifying the Direction of Behavioral Dependence in Two-Sample Capture-Recapture Study

Journal of Official Statistics ◽

10.2478/jos-2020-0002 ◽

2020 ◽

Vol 36 (1) ◽

pp. 25-48

Author(s):

Kiranmoy Chatterjee ◽

Diganta Mukherjee

Keyword(s):

Recent Literature ◽

Real Data ◽

Dual System ◽

Public Health Official ◽

Data Sets ◽

Simulation Studies ◽

Comparative Performance ◽

Health Official ◽

Wide Range ◽

Capture Recapture

AbstractWith the possibility of dependence between the sources in a capture-recapture type experiment, identification of the direction of such dependence in dual system of data collection is vital. This has a wide range of applications, including in the domains of public health, official statistics and social sciences. Owing to the insufficiency of data for analyzing a behavioral dependence model in dual system, our contribution lies in the construction of several strategies that can identify the direction of underlying dependence between the two lists in the dual system, that is, whether the two lists are positively or negatively dependent. Our proposed classification strategies would be quite appealing for improving the inference as evident from recent literature. Simulation studies are carried out to explore the comparative performance of the proposed strategies. Finally, applications on three real data sets from various fields are illustrated.

Download Full-text

Data Classification in Complex Networks via Pattern Conformation, Data Importance and Structural Optimization

10.5753/ctd.2017.3463 ◽

2017 ◽

Author(s):

Murillo G. Carneiro ◽

Liang Zhao

Keyword(s):

Structural Optimization ◽

Complex Networks ◽

Data Classification ◽

Real Data ◽

Disease Diagnosis ◽

Data Sets ◽

Semantic Role Labeling ◽

Wide Range ◽

Heart Disease Diagnosis ◽

Physical Features

Most data classification techniques rely only on the physical features of the data (e.g., similarity, distance or distribution), which makes them difficult to detect intrinsic and semantic relations among data items, such as the pattern formation, for instance. In this thesis, it is proposed classification methods based on complex networks in order to consider not only physical features but also capture structural and dynamical properties of the data through the network representation. The proposed methods comprise concepts of pattern conformation, data importance and network structural optimization, which are related to complex networks theory, learning systems, and bioinspired optimization. Extensive experiments demonstrate the good performance of our methods when compared against representative state-of-the-art methods over a wide range of artificial and real data sets, including applications in domains such as heart disease diagnosis and semantic role labeling.

Download Full-text

R3D3: A Doubly Opportunistic Data Structure for Compressing and Indexing Massive Data

Infocommunications journal ◽

10.36244/icj.2019.2.7 ◽

2019 ◽

pp. 58-66

Author(s):

Máté Nagy ◽

János Tapolcai ◽

Gábor Rétvári

Keyword(s):

Data Structure ◽

Data Structures ◽

Real Data ◽

Small Error ◽

Data Sets ◽

Space Reduction ◽

Wide Range ◽

Arbitrary Position ◽

Efficient Data ◽

Space Requirements

Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in iformationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed.

Download Full-text