Application of Improved Boosting Algorithm for Art Image Classification

Scientific Programming ◽

10.1155/2021/3480414 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Yue Wu

Keyword(s):

Data Mining ◽

Image Classification ◽

Large Scale ◽

Object Identification ◽

Training Data ◽

Training Dataset ◽

Specific Method ◽

Image Mining ◽

Data Mining Technique ◽

Boosting Algorithm

In the field of computer science, data mining is a hot topic. It is a mathematical method for identifying patterns in enormous amounts of data. Image mining is an important data mining technique involving a variety of fields. In image mining, art image organization is an interesting research field worthy of attention. The classification of art images into several predetermined sets is referred to as art image categorization. Image preprocessing, feature extraction, object identification, object categorization, object segmentation, object classification, and a variety of other approaches are all part of it. The purpose of this paper is to suggest an improved boosting algorithm that employs a specific method of traditional and simple, yet weak classifiers to create a complex, accurate, and strong classifier image as well as a realistic image. This paper investigated the characteristics of cartoon images, realistic images, painting images, and photo images, created color variance histogram features, and used them for classification. To execute classification experiments, this paper uses an image database of 10471 images, which are randomly distributed into two portions that are used as training data and test data, respectively. The training dataset contains 6971 images, while the test dataset contains 3478 images. The investigational results show that the planned algorithm has a classification accuracy of approximately 97%. The method proposed in this paper can be used as the basis of automatic large-scale image classification and has strong practicability.

Download Full-text

DeepSSPred: A Deep Learning Based Sulfenylation site predictor via a novel n-segmented optimize federated feature encoder

Protein and Peptide Letters ◽

10.2174/0929866527666201202103411 ◽

2020 ◽

Vol 27 ◽

Author(s):

Zaheer Ullah Khan ◽

Dechang Pi

Keyword(s):

Large Scale ◽

Computational Models ◽

Research Work ◽

Training Data ◽

Training Dataset ◽

Validation Dataset ◽

Cytokine Signaling ◽

Minority Class ◽

Independent Dataset ◽

Feature Encoding

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.

Download Full-text

Combining Self-supervised Learning and Active Learning for Disfluency Detection

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3487290 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-25

Author(s):

Shaolei Wang ◽

Zhongyuan Wang ◽

Wanxiang Che ◽

Sendong Zhao ◽

Ting Liu

Keyword(s):

Neural Network ◽

Active Learning ◽

Supervised Learning ◽

Large Scale ◽

Training Data ◽

Fine Tuning ◽

Training Dataset ◽

Performance Gap ◽

Annotation Costs ◽

Trained Neural Network

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

Download Full-text

Scalable 2-Pass Data Mining Technique for Large Scale Spatio-temporal Datasets

Lecture Notes in Computer Science - Knowledge-Based Intelligent Information and Engineering Systems ◽

10.1007/978-3-540-74827-4_99 ◽

2007 ◽

pp. 785-792

Author(s):

Tahar Kechadi ◽

Michela Bertolotto

Keyword(s):

Data Mining ◽

Large Scale ◽

Data Mining Technique ◽

Mining Technique ◽

Spatio Temporal

Download Full-text

Combining Data Mining Technique and Group Method of Data Handling (GMDH) Method to Assess Flexible Pavement Conditions

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.255-260.4242 ◽

2011 ◽

Vol 255-260 ◽

pp. 4242-4246 ◽

Cited By ~ 1

Author(s):

Hui Mi Hsu ◽

Sao Jeng Chao ◽

Jia Ruey Chang

Keyword(s):

Data Mining ◽

Model Development ◽

Assessment Model ◽

Training Dataset ◽

Group Method ◽

Data Handling ◽

Data Mining Technique ◽

Pavement Condition ◽

Mining Technique ◽

Combining Data

The pavement condition index (PCI), a numerical rating from 0 to 100, gives a good indication of the pavement condition. However, the pavement distress survey is a labor-intensive procedure which is performed quite subjectively by experienced pavement engineers. Then, a highly complicated calculation is required to determine the PCI of a road network. It is advantageous to determine the PCI from relevant pavement parameters. This study demonstrates how to develop a PCI assessment model based on pavement parameters by combining data mining technique and group method of data handling (GMDH) method. Records from provincial and county roads with asphalt surface and wide variety of pavement structure in Taiwan were employed. After conducting the find dependencies (FD) algorithm in data mining techniques, 120 dependent records were extracted from 253 raw records. For the PCI model development, 100 records were randomly selected as the training dataset. GMDH was successfully applied to develop a PCI assessment model that uses 7 critical pavement parameters and PCI as inputs and output, respectively. The R2 for the training dataset is 0.849. The rest of 20 records were utilized as the testing dataset, which has 0.851 of R2 based on the PCI assessment model. This study confirms that combining data mining technique and GMDH method has the potential to provide significant assistance in pavement condition assessment. The model proposed in this study provides a good foundation for further refinement when additional data is available.

Download Full-text

Various viewpoints analysis of the actual and large-scale data by using the data mining technique

Proceedings 39th Annual 2005 International Carnahan Conference on Security Technology ◽

10.1109/ccst.2005.1594821 ◽

2005 ◽

Author(s):

K. Tamura ◽

K. Matsuura ◽

H. Imai

Keyword(s):

Data Mining ◽

Large Scale ◽

Data Mining Technique ◽

Mining Technique ◽

Large Scale Data ◽

Scale Data

Download Full-text

Penerapan Algoritma K-Means Clustering untuk Melihat Hubungan Kegiatan Tahfiz dengan Hasil Belajar

Jurnal Sistim Informasi dan Teknologi ◽

10.37034/jsisfotek.v2i2.20 ◽

2020 ◽

pp. 41-47

Author(s):

Asri Hidayad ◽

Sarjon Defit ◽

S Sumijan

Keyword(s):

Data Mining ◽

Learning Outcomes ◽

Clustering Algorithm ◽

Student Learning Outcomes ◽

Training Data ◽

Data Mining Algorithm ◽

Data Mining Technique ◽

Grouping Method ◽

Student Grades ◽

And Training

The purpose of this study is to evaluate whether Tahfiz activities and learning outcomes are effective or not. The data processed in this study were data on tahfiz activities and data on student learning outcomes in class XI (eleven) totaling 42 data sourced from memorization of tahfiz, tahfiz grades, and student grades in Madrasah Aliyah Negeri 1 Bukittinggi. Based on the analysis of the data, this classification uses one of the methods of the Data Mining algorithm, K-Means Clustering. K-Means Clustering algorithm works based on the grouping method, In this data mining technique consists of data testing and training data with the input of the number of memorization of tahfiz, and the value of tahfiz as well as learning outcomes. The results of this study the school can determine how influential this activity tahfiz on student grades.

Download Full-text

Erratum: Scalable 2-Pass Data Mining Technique for Large Scale Spatio-temporal Datasets

Lecture Notes in Computer Science - Knowledge-Based Intelligent Information and Engineering Systems ◽

10.1007/978-3-540-74827-4_171 ◽

2007 ◽

pp. E1-E1

Author(s):

Tahar Kechadi ◽

Michela Bertolotto ◽

Sergio Di Martino ◽

Filomena Ferrucci

Keyword(s):

Data Mining ◽

Large Scale ◽

Data Mining Technique ◽

Mining Technique ◽

Spatio Temporal

Download Full-text

Penerapan Algoritma K-Means Clustering untuk Melihat Hubungan Kegiatan Tahfiz dengan Hasil Belajar (Studi Kasus Madrasah Aliyah Negeri 1 Bukiktinggi)

Jurnal Sistim Informasi dan Teknologi ◽

10.37034/jsisfotek.v2i2.34 ◽

2020 ◽

Vol 2 (2) ◽

pp. 41-47

Author(s):

Asri Hidayad ◽

Sarjon Defit ◽

Sumijan Sumijan

Keyword(s):

Data Mining ◽

Learning Outcomes ◽

Clustering Algorithm ◽

Student Learning Outcomes ◽

Training Data ◽

Data Mining Algorithm ◽

Data Mining Technique ◽

Grouping Method ◽

Student Grades ◽

And Training

Download Full-text

Energy Prediction of Wheat Production Using Data Mining Technique in Iran

Basrah Journal of Agricultural Sciences ◽

10.37077/25200860.2021.34.1.02 ◽

2021 ◽

Vol 34 (1) ◽

pp. 14-27

Author(s):

Nasim Monjezi

Keyword(s):

Neural Network ◽

Data Mining ◽

Energy Consumption ◽

Training Data ◽

Data Mining Technique ◽

Essential Information ◽

Wheat Production ◽

Energy Prediction ◽

Mining Technique ◽

Practical Algorithms

Wheat is considered as one of the most important products in Iran. Concerning high cultivation area of wheat in Khuzestan, an instrument is required to process stored data in order to give information resulted from such processing to managers of agricultural sectors. Data mining technique is able to give essential information and models to producers of wheat for modelling energy consumption. One of the most practical algorithms is an artificial neural network. The main aim of this research is to predict output energy of wheat farms using a multilayer perceptron neural network. This is an analytic research and its database consists of 1240 records. Data required for the research was obtained from wheat farm during 2014-2018. Data analysis was done via IBM SPSS modeller 14.2 and standard CRISP. Concerning the model used in the research, it was found that variables of chemical fertilizers, machinery & diesel fuel with coefficients of 0.2987, 0.2064 and 0.1527 respectively had the highest effect on output variable (productive energy). Amount of prediction precision in neural network algorithm, meaning ratio of correctly predicted records to total records was 93.08%. Also, linear correlation between actual values and predicted values were 0.92 and 0.88 respectively, for training data and testing data suggesting strong correlation. The results obtained can be effective for wheat farmers in direction of evaluation and optimization of energy consumption in process of wheat production and reduction of consumption of energy inputs.

Download Full-text

Time Series GIS Map Dataset of Demolished Buildings in Mashiki Town after the 2016 Kumamoto, Japan Earthquake

Remote Sensing ◽

10.3390/rs11192190 ◽

2019 ◽

Vol 11 (19) ◽

pp. 2190

Author(s):

Kushiyama ◽

Matsuoka

Keyword(s):

Time Series ◽

Waste Disposal ◽

Large Scale ◽

Training Data ◽

Training Dataset ◽

Building Structures ◽

Visual Interpretation ◽

Imbalanced Classification ◽

Map Data ◽

F Measure

After a large-scale disaster, many damaged buildings are demolished and treated as disaster waste. Though the weight of disaster waste was estimated two months after the 2016 earthquake in Kumamoto, Japan, the estimated weight was significantly different from the result when the disaster waste disposal was completed in March 2018. The amount of disaster waste generated is able to be estimated by an equation by multiplying the total number of severely damaged and partially damaged buildings by the coefficient of generated weight per building. We suppose that the amount of disaster waste would be affected by the conditions of demolished buildings, namely, the areas and typologies of building structures, but this has not yet been clarified. Therefore, in this study, we aimed to use geographic information system (GIS) map data to create a time series GIS map dataset with labels of demolished and remaining buildings in Mashiki town for the two-year period prior to the completion of the disaster waste disposal. We used OpenStreetMap (OSM) data as the base data and time series SPOT images observed in the two years following the Kumamoto earthquake to label all demolished and remaining buildings in the GIS map dataset. To effectively label the approximately 16,000 buildings in Mashiki town, we calculated an indicator that shows the possibility of the buildings to be classified as the remaining and demolished buildings from a change of brightness in SPOT images. We classified 5701 demolished buildings from 16,106 buildings, as of March 2018, by visual interpretation of the SPOT and Pleiades images with reference to this indicator. We verified that the number of demolished buildings was almost the same as the number reported by Mashiki municipality. Moreover, we assessed the accuracy of our proposed method: The F-measure was higher than 0.9 using the training dataset, which was verified by a field survey and visual interpretation, and included the labels of the 55 demolished and 55 remaining buildings. We also assessed the accuracy of the proposed method by applying it to all the labels in the OSM dataset, but the F-measure was 0.579. If we applied test data including balanced labels of the other 100 demolished and 100 remaining buildings, which were other than the training data, the F-measure was 0.790 calculated from the SPOT image of 25 March 2018. Our proposed method performed better for the balanced classification but not for imbalanced classification. We studied the examples of image characteristics of correct and incorrect estimation by thresholding the indicator.

Download Full-text