A Fast Logdet Divergence Based Metric Learning Algorithm for Large Data Sets Classification

Abstract and Applied Analysis ◽

10.1155/2014/463981 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Jiangyuan Mei ◽

Jian Hou ◽

Jicheng Chen ◽

Hamid Reza Karimi

Keyword(s):

Learning Algorithm ◽

Metric Learning ◽

Large Data ◽

Feature Space ◽

Industrial Applications ◽

Large Data Sets ◽

Training Data ◽

High Dimensional ◽

Data Sets ◽

Benchmark Data

Large data sets classification is widely used in many industrial applications. It is a challenging task to classify large data sets efficiently, accurately, and robustly, as large data sets always contain numerous instances with high dimensional feature space. In order to deal with this problem, in this paper we present an online Logdet divergence based metric learning (LDML) model by making use of the powerfulness of metric learning. We firstly generate a Mahalanobis matrix via learning the training data with LDML model. Meanwhile, we propose a compressed representation for high dimensional Mahalanobis matrix to reduce the computation complexity in each iteration. The final Mahalanobis matrix obtained this way measures the distances between instances accurately and serves as the basis of classifiers, for example, thek-nearest neighbors classifier. Experiments on benchmark data sets demonstrate that the proposed algorithm compares favorably with the state-of-the-art methods.

Download Full-text

Feature Selection Using Neighborhood Positive Approximation Rough Set

Feature Dimension Reduction for Content-Based Image Identification - Advances in Multimedia and Interactive Technologies ◽

10.4018/978-1-5225-5775-3.ch005 ◽

2018 ◽

pp. 74-99

Author(s):

Mohammad Atique ◽

Leena Homraj Patil

Keyword(s):

Feature Selection ◽

Rough Set ◽

Attribute Reduction ◽

Large Data ◽

Feature Space ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Overall Performance ◽

Feature Selection Approach

Attribute reduction and feature selection is the main issue in rough set. Researchers have focused on several attribute reduction using rough set. However, the methods found are time consuming for large data sets. Since the key lies in reducing the attributes and selecting the relevant features, the main aim is to reduce the dimensionality of huge amount of data to get the smaller subset which can provide the useful information. Feature selection approach reduces the dimensionality of feature space and improves the overall performance. The challenge in feature selection is to deal with high dimensional. To overcome the issues and challenges, this chapter describes a feature selection based on the proposed neighborhood positive approximation approach and attributes reduction for data sets. This proposed system implements for attribute reduction and finds the relevant features. Evaluation shows that the proposed neighborhood positive approximation algorithm is effective and feasible for large data sets and also reduces the feature space.

Download Full-text

Understanding High Dimensional and Large Data Sets: Some Mathematical Challenges and Opportunities

Data Mining for Scientific and Engineering Applications - Massive Computing ◽

10.1007/978-1-4615-1733-7_2 ◽

2001 ◽

pp. 23-34

Author(s):

Jagdish Chandra

Keyword(s):

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Challenges And Opportunities

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

An Improved Fellegi-Sunter Framework for Probabilistic Record Linkage Between Large Data Sets

Journal of Official Statistics ◽

10.2478/jos-2020-0039 ◽

2020 ◽

Vol 36 (4) ◽

pp. 803-825

Author(s):

Marco Fortini

Keyword(s):

Record Linkage ◽

Large Data ◽

Real Data ◽

Large Data Sets ◽

Training Data ◽

Data Sets ◽

Estimated Parameters ◽

Unbiased Estimates ◽

Structural Zeros ◽

Different Sources

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.

Download Full-text

Learning algorithm developed to mine large data sets

Physics Today ◽

10.1063/pt.5.027815 ◽

2014 ◽

Keyword(s):

Learning Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

Singular Value Decomposition, Clustering, and Indexing for Similarity Search for Large Data Sets in High-Dimensional Spaces

Big Data ◽

10.1201/b18050-8 ◽

2015 ◽

pp. 76-107

Keyword(s):

Singular Value Decomposition ◽

Similarity Search ◽

Large Data ◽

Singular Value ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Value Decomposition

Download Full-text

Position Regularized Core Vector Machines

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.574.728 ◽

2014 ◽

Vol 574 ◽

pp. 728-733

Author(s):

Shu Xia Lu ◽

Cai Hong Jiao ◽

Le Tong ◽

Yang Fan Zhou

Keyword(s):

Large Data ◽

Experimental Results ◽

Large Data Sets ◽

Data Sets ◽

Benchmark Data ◽

Vector Machines ◽

Data Points ◽

Minimum Enclosing Ball ◽

Better Than

Core Vector Machine (CVM) can be used to deal with large data sets by find minimum enclosing ball (MEB), but one drawback is that CVM is very sensitive to the outliers. To tackle this problem, we propose a novel Position Regularized Core Vector Machine (PCVM).In the proposed PCVM, the data points are regularized by assigning a position-based weighting. Experimental results on several benchmark data sets show that the performance of PCVM is much better than CVM.

Download Full-text

Car-Following Described by Blending Data-Driven and Analytical Models: A Gaussian Process Regression Approach

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211032648 ◽

2021 ◽

pp. 036119812110326

Author(s):

Ignasi Echaniz Soldevila ◽

Victor L. Knoop ◽

Serge Hoogendoorn

Keyword(s):

Gaussian Process Regression ◽

Large Data ◽

Driving Behavior ◽

Large Data Sets ◽

Training Data ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Car Following ◽

New Variables

Traffic engineers rely on microscopic traffic models to design, plan, and operate a wide range of traffic applications. Recently, large data sets, yet incomplete and from small space regions, are becoming available thanks to technology improvements and governmental efforts. With this study we aim to gain new empirical insights into longitudinal driving behavior and to formulate a model which can benefit from these new challenging data sources. This paper proposes an application of an existing formulation, Gaussian process regression (GPR), to describe individual longitudinal driving behavior of drivers. The method integrates a parametric and a non-parametric mathematical formulation. The model predicts individual driver’s acceleration given a set of variables. It uses the GPR to make predictions when there exists correlation between new input and the training data set. The data-driven model benefits from a large training data set to capture all driver longitudinal behavior, which would be difficult to fit in fixed parametric equation(s). The methodology allows us to train models with new variables without the need of altering the model formulation. And importantly, the model also uses existing traditional parametric car-following models to predict acceleration when no similar situations are found in the training data set. A case study using radar data in an urban environment shows that a hybrid model performs better than parametric model alone and suggests that traffic light status over time influences drivers’ acceleration. This methodology can help engineers to use large data sets and to find new variables to describe traffic behavior.

Download Full-text

A convolutional neural network-based screening tool for X-ray serial crystallography

Journal of Synchrotron Radiation ◽

10.1107/s1600577518004873 ◽

2018 ◽

Vol 25 (3) ◽

pp. 655-670 ◽

Cited By ~ 10

Author(s):

Tsung-Wei Ke ◽

Aaron S. Brewster ◽

Stella X. Yu ◽

Daniela Ushizima ◽

Chao Yang ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Large Data ◽

Large Data Sets ◽

Training Data ◽

Data Sets ◽

X Ray ◽

X Ray Crystallography ◽

Automatic Image Processing

A new tool is introduced for screening macromolecular X-ray crystallography diffraction images produced at an X-ray free-electron laser light source. Based on a data-driven deep learning approach, the proposed tool executes a convolutional neural network to detect Bragg spots. Automatic image processing algorithms described can enable the classification of large data sets, acquired under realistic conditions consisting of noisy data with experimental artifacts. Outcomes are compared for different data regimes, including samples from multiple instruments and differing amounts of training data for neural network optimization.

Download Full-text

Visualization of High-Dimensional Data with Polar Coordinates

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch315 ◽

2011 ◽

pp. 2062-2067

Author(s):

Frank Rehm ◽

Frank Klawonn ◽

Rudolf Kruse

Keyword(s):

High Dimensional Data ◽

Computation Time ◽

Large Data ◽

Feature Space ◽

Three Dimensions ◽

Polar Coordinates ◽

High Dimensional ◽

Data Sets ◽

Memory Space ◽

Resource Requirements

Many applications in science and business such as signal analysis or costumer segmentation deal with large amounts of data which are usually high dimensional in the feature space. As a part of preprocessing and exploratory data analysis, visualization of the data helps to decide which kind of data mining method probably leads to good results or whether outliers or noisy data need to be treated before (Barnett & Lewis, 1994; Hawkins, 1980). Since the visual assessment of a feature space that has more than three dimensions is not possible, it becomes necessary to find an appropriate visualization scheme for such data sets. Multidimensional scaling (MDS) is a family of methods that seek to present the important structure of the data in a reduced number of dimensions. Due to the approach of distance preservation that is followed by conventional MDS techniques, resource requirements regarding memory space and computation time are fairly high and prevent their application to large data sets. In this work we will present two methods that visualize high-dimensional data on the plane using a new approach. An algorithm will be presented that allows applying our method on larger data sets. We will also present some results on a benchmark data set.

Download Full-text