Support Vector classification for large data sets by reducing training data with change of classes

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.

Download Full-text

Support Vector Machine Classification Based on Fuzzy Clustering for Large Data Sets

Lecture Notes in Computer Science - MICAI 2006: Advances in Artificial Intelligence ◽

10.1007/11925231_54 ◽

2006 ◽

pp. 572-582 ◽

Cited By ~ 18

Author(s):

Jair Cervantes ◽

Xiaoou Li ◽

Wen Yu

Keyword(s):

Support Vector Machine ◽

Fuzzy Clustering ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Support Vector Machine Classification

Download Full-text

Nonlinear clustering-based support vector machine for large data sets

Optimization Methods and Software ◽

10.1080/10556780802102453 ◽

2008 ◽

Vol 23 (4) ◽

pp. 533-549 ◽

Cited By ~ 1

Author(s):

Yongqiao Wang ◽

Xun Zhang ◽

Souyang Wang ◽

K.K. Lai

Keyword(s):

Support Vector Machine ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets

Download Full-text

Multi-Class Support Vector Machines for Large Data Sets via Minimum Enclosing Ball Clustering

2007 4th International Conference on Electrical and Electronics Engineering ◽

10.1109/iceee.2007.4344994 ◽

2007 ◽

Cited By ~ 2

Author(s):

Jair Cervantes ◽

Xiaoou Li ◽

Wen Yu ◽

Javier Bejarano

Keyword(s):

Support Vector Machines ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Minimum Enclosing Ball

Download Full-text

Car-Following Described by Blending Data-Driven and Analytical Models: A Gaussian Process Regression Approach

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211032648 ◽

2021 ◽

pp. 036119812110326

Author(s):

Ignasi Echaniz Soldevila ◽

Victor L. Knoop ◽

Serge Hoogendoorn

Keyword(s):

Gaussian Process Regression ◽

Large Data ◽

Driving Behavior ◽

Large Data Sets ◽

Training Data ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Car Following ◽

New Variables

Traffic engineers rely on microscopic traffic models to design, plan, and operate a wide range of traffic applications. Recently, large data sets, yet incomplete and from small space regions, are becoming available thanks to technology improvements and governmental efforts. With this study we aim to gain new empirical insights into longitudinal driving behavior and to formulate a model which can benefit from these new challenging data sources. This paper proposes an application of an existing formulation, Gaussian process regression (GPR), to describe individual longitudinal driving behavior of drivers. The method integrates a parametric and a non-parametric mathematical formulation. The model predicts individual driver’s acceleration given a set of variables. It uses the GPR to make predictions when there exists correlation between new input and the training data set. The data-driven model benefits from a large training data set to capture all driver longitudinal behavior, which would be difficult to fit in fixed parametric equation(s). The methodology allows us to train models with new variables without the need of altering the model formulation. And importantly, the model also uses existing traditional parametric car-following models to predict acceleration when no similar situations are found in the training data set. A case study using radar data in an urban environment shows that a hybrid model performs better than parametric model alone and suggests that traffic light status over time influences drivers’ acceleration. This methodology can help engineers to use large data sets and to find new variables to describe traffic behavior.

Download Full-text

A convolutional neural network-based screening tool for X-ray serial crystallography

Journal of Synchrotron Radiation ◽

10.1107/s1600577518004873 ◽

2018 ◽

Vol 25 (3) ◽

pp. 655-670 ◽

Cited By ~ 10

Author(s):

Tsung-Wei Ke ◽

Aaron S. Brewster ◽

Stella X. Yu ◽

Daniela Ushizima ◽

Chao Yang ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Large Data ◽

Large Data Sets ◽

Training Data ◽

Data Sets ◽

X Ray ◽

X Ray Crystallography ◽

Automatic Image Processing

A new tool is introduced for screening macromolecular X-ray crystallography diffraction images produced at an X-ray free-electron laser light source. Based on a data-driven deep learning approach, the proposed tool executes a convolutional neural network to detect Bragg spots. Automatic image processing algorithms described can enable the classification of large data sets, acquired under realistic conditions consisting of noisy data with experimental artifacts. Outcomes are compared for different data regimes, including samples from multiple instruments and differing amounts of training data for neural network optimization.

Download Full-text

Fast Support Vector Machine Classification for Large Data Sets

International Journal of Computational Intelligence Systems ◽

10.1080/18756891.2013.868148 ◽

2013 ◽

Vol 7 (2) ◽

pp. 197-212 ◽

Cited By ~ 4

Author(s):

Xiaoou Li ◽

Wen Yu

Keyword(s):

Support Vector Machine ◽

Large Data ◽

Large Data Sets ◽

Support Vector ◽

Data Sets ◽

Support Vector Machine Classification

Download Full-text

A Practical Robust and Efficient RBF Metamodel Method for Typical Engineering Problems

Volume 1: 34th Design Automation Conference, Parts A and B ◽

10.1115/detc2008-49994 ◽

2008 ◽

Cited By ~ 1

Author(s):

Xingjie Fang ◽

Liping Wang ◽

Don Beeson ◽

Gene Wiggs

Keyword(s):

Principal Component ◽

Large Data ◽

Large Data Sets ◽

Training Data ◽

Data Sets ◽

Dimensional Model ◽

Data Set ◽

Engineering Problems ◽

Processing Techniques ◽

Generalization Accuracy

Radial Basis Function (RBF) metamodels have recently attracted increased interest due to their significant advantages over other types of non-parametric metamodels. However, because of the interpolation nature of the RBF mathematics, the accuracy of the model may dramatically deteriorate if the training data set used contains duplicate information, noise or outliers. Also constructing the metamodel may be time consuming whenever the training data sets are large or a high dimensional model is required. In this paper, we propose a robust and efficient RBF metamodeling approach based on data pre-processing techniques that alleviate the accuracy and efficiency issues commonly encountered when RBF models are used in typical real engineering situations. These techniques include 1) the removal of duplicate training data information, 2) the generation of smaller uniformly distributed subsets of training data from large data sets and 3) the quantification and identification of outliers by principal component analysis (PCA) and Hotelling statistics. Simulation results are used to validate the generalization accuracy and efficiency of the proposed approach.

Download Full-text