scholarly journals LCSS-Based Algorithm for Computing Multivariate Data Set Similarity: A Case Study of Real-Time WSN Data

Sensors ◽  
2019 ◽  
Vol 19 (1) ◽  
pp. 166 ◽  
Author(s):  
Rahim Khan ◽  
Ihsan Ali ◽  
Saleh M. Altowaijri ◽  
Muhammad Zakarya ◽  
Atiq Ur Rahman ◽  
...  

Multivariate data sets are common in various application areas, such as wireless sensor networks (WSNs) and DNA analysis. A robust mechanism is required to compute their similarity indexes regardless of the environment and problem domain. This study describes the usefulness of a non-metric-based approach (i.e., longest common subsequence) in computing similarity indexes. Several non-metric-based algorithms are available in the literature, the most robust and reliable one is the dynamic programming-based technique. However, dynamic programming-based techniques are considered inefficient, particularly in the context of multivariate data sets. Furthermore, the classical approaches are not powerful enough in scenarios with multivariate data sets, sensor data or when the similarity indexes are extremely high or low. To address this issue, we propose an efficient algorithm to measure the similarity indexes of multivariate data sets using a non-metric-based methodology. The proposed algorithm performs exceptionally well on numerous multivariate data sets compared with the classical dynamic programming-based algorithms. The performance of the algorithms is evaluated on the basis of several benchmark data sets and a dynamic multivariate data set, which is obtained from a WSN deployed in the Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology. Our evaluation suggests that the proposed algorithm can be approximately 39.9% more efficient than its counterparts for various data sets in terms of computational time.

Author(s):  
Kyungkoo Jun

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 879 ◽  
Author(s):  
Uwe Köckemann ◽  
Marjan Alirezaie ◽  
Jennifer Renoux ◽  
Nicolas Tsiftes ◽  
Mobyen Uddin Ahmed ◽  
...  

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.


2020 ◽  
Vol 12 (23) ◽  
pp. 4007
Author(s):  
Kasra Rafiezadeh Shahi ◽  
Pedram Ghamisi ◽  
Behnood Rasti ◽  
Robert Jackisch ◽  
Paul Scheunders ◽  
...  

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 825 ◽  
Author(s):  
Fadi Al Machot ◽  
Mohammed R. Elkobaisi ◽  
Kyandoghere Kyamakya

Due to significant advances in sensor technology, studies towards activity recognition have gained interest and maturity in the last few years. Existing machine learning algorithms have demonstrated promising results by classifying activities whose instances have been already seen during training. Activity recognition methods based on real-life settings should cover a growing number of activities in various domains, whereby a significant part of instances will not be present in the training data set. However, to cover all possible activities in advance is a complex and expensive task. Concretely, we need a method that can extend the learning model to detect unseen activities without prior knowledge regarding sensor readings about those previously unseen activities. In this paper, we introduce an approach to leverage sensor data in discovering new unseen activities which were not present in the training set. We show that sensor readings can lead to promising results for zero-shot learning, whereby the necessary knowledge can be transferred from seen to unseen activities by using semantic similarity. The evaluation conducted on two data sets extracted from the well-known CASAS datasets show that the proposed zero-shot learning approach achieves a high performance in recognizing unseen (i.e., not present in the training dataset) new activities.


Author(s):  
Christopher MacDonald ◽  
Michael Yang ◽  
Shawn Learn ◽  
Ron Hugo ◽  
Simon Park

Abstract There are several challenges associated with existing rupture detection systems such as their inability to accurately detect during transient (such as pump dynamics) conditions, delayed responses and their inability to transfer models to different pipeline configurations easily. To address these challenges, we employ multiple Artificial Intelligence (AI) classifiers that rely on pattern recognitions instead of traditional operator-set thresholds. AI techniques, consisting of two-dimensional (2D) Convolutional Neural Networks (CNN) and Adaptive Neuro Fuzzy Interface Systems (ANFIS), are used to mimic processes performed by operators during a rupture event. This includes both visualization (using CNN) and rule-based decision making (using ANFIS). The system provides a level of reasoning to an operator through the use of the rule-based AI system. Pump station sensor data is non-dimensionalized prior to AI processing, enabling application to pipeline configurations outside of the training data set. AI algorithms undergo testing and training using two data sets: laboratory-collected data that mimics transient pump-station operations and real operator data that includes Real Time Transient Model (RTTM) simulated ruptures. The use of non-dimensional sensor data enables the system to detect ruptures from pipeline data not used in the training process.


1999 ◽  
Vol 09 (03) ◽  
pp. 195-202 ◽  
Author(s):  
JOSÉ ALFREDO FERREIRA COSTA ◽  
MÁRCIO LUIZ DE ANDRADE NETTO

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.


2021 ◽  
Author(s):  
◽  
Nazrina Aziz

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>


Author(s):  
M. Sheshikala ◽  
D. Rajeswara Rao ◽  
Md. Ali Kadampur

In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs). This method tries to find all colocations that are to be generated from a random world. For this we first apply an approximation error to find all the PPCs which reduce the computations. Next find all the possible worlds and split them into two different worlds and compute the prevalence probability. These worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant improvement in computational time in comparison to some of the existing methods used in colocation mining.


2006 ◽  
Vol 5 (2) ◽  
pp. 125-136 ◽  
Author(s):  
Jimmy Johansson ◽  
Patric Ljung ◽  
Mikael Jern ◽  
Matthew Cooper

Parallel coordinates is a well-known technique used for visualization of multivariate data. When the size of the data sets increases the parallel coordinates display results in an image far too cluttered to perceive any structure. We tackle this problem by constructing high-precision textures to represent the data. By using transfer functions that operate on the high-precision textures, it is possible to highlight different aspects of the entire data set or clusters of the data. Our methods are implemented in both standard 2D parallel coordinates and 3D multi-relational parallel coordinates. Furthermore, when visualizing a larger number of clusters, a technique called ‘feature animation’ may be used as guidance by presenting various cluster statistics. A case study is also performed to illustrate the analysis process when analysing large multivariate data sets using our proposed techniques.


Sign in / Sign up

Export Citation Format

Share Document