Globally Approximate Gaussian Processes for Big Data With Application to Data-Driven Metamaterials Design

Ramin Bostanabad; Yu-Chin Chan; Liwei Wang; Ping Zhu; Wei Chen

doi:10.1115/1.4044257

Globally Approximate Gaussian Processes for Big Data With Application to Data-Driven Metamaterials Design

Journal of Mechanical Design ◽

10.1115/1.4044257 ◽

2019 ◽

Vol 141 (11) ◽

Author(s):

Ramin Bostanabad ◽

Yu-Chin Chan ◽

Liwei Wang ◽

Ping Zhu ◽

Wei Chen

Keyword(s):

Big Data ◽

Gaussian Process ◽

Engineering Design ◽

Large Scale ◽

Training Data ◽

Data Driven ◽

Training Dataset ◽

Massive Datasets ◽

Power Matching ◽

Unit Cells

Abstract We introduce a novel method for Gaussian process (GP) modeling of massive datasets called globally approximate Gaussian process (GAGP). Unlike most large-scale supervised learners such as neural networks and trees, GAGP is easy to fit and can interpret the model behavior, making it particularly useful in engineering design with big data. The key idea of GAGP is to build a collection of independent GPs that use the same hyperparameters but randomly distribute the entire training dataset among themselves. This is based on our observation that the GP hyperparameter approximations change negligibly as the size of the training data exceeds a certain level, which can be estimated systematically. For inference, the predictions from all GPs in the collection are pooled, allowing the entire training dataset to be efficiently exploited for prediction. Through analytical examples, we demonstrate that GAGP achieves very high predictive power matching (and in some cases exceeding) that of state-of-the-art supervised learning methods. We illustrate the application of GAGP in engineering design with a problem on data-driven metamaterials, using it to link reduced-dimension geometrical descriptors of unit cells and their properties. Searching for new unit cell designs with desired properties is then achieved by employing GAGP in inverse optimization.

Download Full-text

Gaussian Process Emulation for Big Data in Data-Driven Metamaterials Design

Volume 2A: 45th Design Automation Conference ◽

10.1115/detc2019-98027 ◽

2019 ◽

Author(s):

Ramin Bostanabad ◽

Yu-Chin Chan ◽

Liwei Wang ◽

Ping Zhu ◽

Wei Chen

Keyword(s):

Big Data ◽

Gaussian Process ◽

Engineering Design ◽

Training Data ◽

Data Driven ◽

Training Dataset ◽

Massive Datasets ◽

Unit Cells ◽

Gaussian Process Emulation ◽

Novel Method

Abstract Our main contribution is to introduce a novel method for Gaussian process (GP) modeling of massive datasets. The key idea is to build an ensemble of independent GPs that use the same hyperparameters but distribute the entire training dataset among themselves. This is motivated by our observation that estimates of the GP hyperparameters change negligibly as the size of the training data exceeds a certain level, which can be found in a systematic way. For inference, the predictions from all GPs in the ensemble are pooled to efficiently exploit the entire training dataset for prediction. We name our modeling approach globally approximate Gaussian process (GAGP), which, unlike most largescale supervised learners such as neural networks and trees, is easy to fit and can interpret the model behavior. These features make it particularly useful in engineering design with big data. We use analytical examples to demonstrate that GAGP achieves very high predictive power that matches or exceeds that of state-of-the-art machine learning methods. We illustrate the application of GAGP in engineering design with a problem on data-driven metamaterials design where it is used to link reduced-dimension geometrical descriptors of unit cells and their properties. Searching for new unit cell designs with desired properties is then accomplished by employing GAGP in inverse optimization.

Download Full-text

DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

Protein and Peptide Letters ◽

10.2174/0929866527666201202103411 ◽

2020 ◽

Vol 27 ◽

Author(s):

Zaheer Ullah Khan ◽

Dechang Pi

Keyword(s):

Large Scale ◽

Computational Models ◽

Research Work ◽

Training Data ◽

Training Dataset ◽

Validation Dataset ◽

Cytokine Signaling ◽

Minority Class ◽

Independent Dataset ◽

Feature Encoding

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Download Full-text

Intrusion Detection Based on Big Data Fuzzy Analytics

10.5772/intechopen.99636 ◽

2021 ◽

Author(s):

Farah Jemili ◽

Hajer Bouras

Keyword(s):

Big Data ◽

Network Security ◽

Intrusion Detection ◽

Intrusion Detection System ◽

High Performance ◽

Detection System ◽

Training Dataset ◽

False Alarms ◽

Massive Datasets ◽

Detection Rates

In today’s world, Intrusion Detection System (IDS) is one of the significant tools used to the improvement of network security, by detecting attacks or abnormal data accesses. Most of existing IDS have many disadvantages such as high false alarm rates and low detection rates. For the IDS, dealing with distributed and massive data constitutes a challenge. Besides, dealing with imprecise data is another challenge. This paper proposes an Intrusion Detection System based on big data fuzzy analytics; Fuzzy C-Means (FCM) method is used to cluster and classify the pre-processed training dataset. The CTU-13 and the UNSW-NB15 are used as distributed and massive datasets to prove the feasibility of the method. The proposed system shows high performance in terms of accuracy, precision, detection rates, and false alarms.

Download Full-text

Combining Self-supervised Learning and Active Learning for Disfluency Detection

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3487290 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-25

Author(s):

Shaolei Wang ◽

Zhongyuan Wang ◽

Wanxiang Che ◽

Sendong Zhao ◽

Ting Liu

Keyword(s):

Neural Network ◽

Active Learning ◽

Supervised Learning ◽

Large Scale ◽

Training Data ◽

Fine Tuning ◽

Training Dataset ◽

Performance Gap ◽

Annotation Costs ◽

Trained Neural Network

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

Download Full-text

Effective Statistical Methods for Big Data Analytics

Handbook of Research on Applied Cybernetics and Systems Science - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-2498-4.ch014 ◽

2017 ◽

pp. 280-299 ◽

Cited By ~ 3

Author(s):

Cheng Meng ◽

Ye Wang ◽

Xinlian Zhang ◽

Abhyuday Mandal ◽

Wenxuan Zhong ◽

...

Keyword(s):

Decision Making ◽

Big Data ◽

Knowledge Discovery ◽

Statistical Methods ◽

Large Scale ◽

Big Data Analytics ◽

Divide And Conquer ◽

Data Driven ◽

The Past ◽

Large Scale Dataset

With advances in technologies in the past decade, the amount of data generated and recorded has grown enormously in virtually all fields of industry and science. This extraordinary amount of data provides unprecedented opportunities for data-driven decision-making and knowledge discovery. However, the task of analyzing such large-scale dataset poses significant challenges and calls for innovative statistical methods specifically designed for faster speed and higher efficiency. In this chapter, we review currently available methods for big data, with a focus on the subsampling methods using statistical leveraging and divide and conquer methods.

Download Full-text

Noise Removal Process from Label Classification using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c3920.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 172-175

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Learning ◽

Noise Removal ◽

Error Rates ◽

Training Data ◽

Learning Performance ◽

Training Dataset ◽

Noise Filtering ◽

Label Noise

Text classification and clustering approach is essential for big data environments. In supervised learning applications many classification algorithms have been proposed. In the era of big data, a large volume of training data is available in many machine learning works. However, there is a possibility of mislabeled or unlabeled data that are not labeled properly. Some labels may be incorrect resulted in label noise which in turn regress learning performance of a classifier. A general approach to address label noise is to apply noise filtering techniques to identify and remove noise before learning. A range of noise filtering approaches have been developed to improve the classifiers performance. This paper proposes noise filtering approach in text data during the training phase. Many supervised learning algorithms generates high error rates due to noise in training dataset, our work eliminates such noise and provides accurate classification system.

Download Full-text

Physics-Driven Regularization of Deep Neural Networks for Enhanced Engineering Design and Analysis

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4044507 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 7

Author(s):

Mohammad Amin Nabian ◽

Hadi Meidani

Keyword(s):

Neural Networks ◽

Engineering Design ◽

Physical System ◽

Deep Neural Networks ◽

Complete Information ◽

Training Data ◽

Data Driven ◽

Training Approach ◽

Domain Expertise ◽

Generalization Errors

Abstract In this paper, we introduce a physics-driven regularization method for training of deep neural networks (DNNs) for use in engineering design and analysis problems. In particular, we focus on the prediction of a physical system, for which in addition to training data, partial or complete information on a set of governing laws is also available. These laws often appear in the form of differential equations, derived from first principles, empirically validated laws, or domain expertise, and are usually neglected in a data-driven prediction of engineering systems. We propose a training approach that utilizes the known governing laws and regularizes data-driven DNN models by penalizing divergence from those laws. The first two numerical examples are synthetic examples, where we show that in constructing a DNN model that best fits the measurements from a physical system, the use of our proposed regularization results in DNNs that are more interpretable with smaller generalization errors, compared with other common regularization methods. The last two examples concern metamodeling for a random Burgers’ system and for aerodynamic analysis of passenger vehicles, where we demonstrate that the proposed regularization provides superior generalization accuracy compared with other common alternatives.

Download Full-text

Application of Improved Boosting Algorithm for Art Image Classification

Scientific Programming ◽

10.1155/2021/3480414 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Yue Wu

Keyword(s):

Data Mining ◽

Image Classification ◽

Large Scale ◽

Object Identification ◽

Training Data ◽

Training Dataset ◽

Specific Method ◽

Image Mining ◽

Data Mining Technique ◽

Boosting Algorithm

In the field of computer science, data mining is a hot topic. It is a mathematical method for identifying patterns in enormous amounts of data. Image mining is an important data mining technique involving a variety of fields. In image mining, art image organization is an interesting research field worthy of attention. The classification of art images into several predetermined sets is referred to as art image categorization. Image preprocessing, feature extraction, object identification, object categorization, object segmentation, object classification, and a variety of other approaches are all part of it. The purpose of this paper is to suggest an improved boosting algorithm that employs a specific method of traditional and simple, yet weak classifiers to create a complex, accurate, and strong classifier image as well as a realistic image. This paper investigated the characteristics of cartoon images, realistic images, painting images, and photo images, created color variance histogram features, and used them for classification. To execute classification experiments, this paper uses an image database of 10471 images, which are randomly distributed into two portions that are used as training data and test data, respectively. The training dataset contains 6971 images, while the test dataset contains 3478 images. The investigational results show that the planned algorithm has a classification accuracy of approximately 97%. The method proposed in this paper can be used as the basis of automatic large-scale image classification and has strong practicability.

Download Full-text

Data-Driven Multiscale Topology Optimization Using Multi-Response Latent Variable Gaussian Process

Volume 11A: 46th Design Automation Conference (DAC) ◽

10.1115/detc2020-22595 ◽

2020 ◽

Author(s):

Liwei Wang ◽

Siyu Tao ◽

Ping Zhu ◽

Wei Chen

Keyword(s):

Topology Optimization ◽

Gaussian Process ◽

Latent Variable ◽

Distance Measure ◽

Material Design ◽

Cell Types ◽

Unit Cell ◽

Data Driven ◽

Single Class ◽

Unit Cells

Abstract The data-driven approach is emerging as a promising method for the topological design of the multiscale structure with greater efficiency. However, existing data-driven methods mostly focus on a single class of unit cells without considering multiple classes to accommodate spatially varying desired properties. The key challenge is the lack of inherent ordering or “distance” measure between different classes of unit cells in meeting a range of properties. To overcome this hurdle, we extend the newly developed latent-variable Gaussian process (LVGP) to creating multi-response LVGP (MRLVGP) for the unit cell libraries of metamaterials, taking both qualitative unit cell concepts and quantitative unit cell design variables as mixed-variable inputs. The MRLVGP embeds the mixed variables into a continuous design space based on their collective effect on the responses, providing substantial insights into the interplay between different geometrical classes and unit cell materials. With this model, we can easily obtain a continuous and differentiable transition between different unit cell concepts that can render gradient information for multiscale topology optimization. While the proposed approach has a broader impact on the concurrent topological and material design of engineered systems, we demonstrate its benefits through multiscale topology optimization with aperiodic unit cells. Design examples reveal that considering multiple unit cell types can lead to improved performance due to the consistent load-transferred paths for micro- and macrostructures.

Download Full-text

Practical Secure Naïve Bayesian Classification Over Encrypted Big Data in Cloud

International Journal of Foundations of Computer Science ◽

10.1142/s0129054117400135 ◽

2017 ◽

Vol 28 (06) ◽

pp. 683-703 ◽

Cited By ~ 3

Author(s):

Youwen Zhu ◽

Xingxin Li ◽

Jian Wang ◽

Yining Liu ◽

Zhiguo Qu

Keyword(s):

Big Data ◽

Data Storage ◽

Large Scale ◽

Cloud Service ◽

Bayesian Classification ◽

Training Data ◽

Naive Bayesian ◽

Naïve Bayesian ◽

Security Proof ◽

Data Owner

Cloud can provide much convenience for big data storage and analysis. To enjoy the advantage of cloud service with privacy preservation, huge data is increasingly outsourced to cloud in encrypted form. Unfortunately, encryption may impede the analysis and computation over the outsourced dataset. Naïve Bayesian classification is an effective algorithm to predict the class label of unlabeled samples. In this paper, we investigate naïve Bayesian classification on encrypted large-scale dataset in cloud, and propose a practical and secure scheme for the challenging problem. In our scheme, all the computation task of naïve Bayesian classification are completed by the cloud, which can dramatically reduce the burden of data owner and users. We give a formal security proof for our scheme. Based on the theoretical proof, we can strictly guarantee the privacy of both input dataset and output classification results, i.e., the cloud can learn nothing useful about the training data of data owner and the test samples of users throughout the computation. Additionally, we not only theoretically analyze our computation complexity and communication overheads, but also evaluate our implementation cost by leveraging extensive experiments over real dataset, which shows our scheme can achieve practical efficiency.

Download Full-text