scholarly journals An Ant Colony Optimization Based Feature Selection for Web Page Classification

2014 ◽  
Vol 2014 ◽  
pp. 1-16 ◽  
Author(s):  
Esra Saraç ◽  
Selma Ayşe Özel

The increased popularity of the web has caused the inclusion of huge amount of information to the web, and as a result of this explosive information growth, automated web page classification systems are needed to improve search engines’ performance. Web pages have a large number of features such as HTML/XML tags, URLs, hyperlinks, and text contents that should be considered during an automated classification process. The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages. In this study, we used an ant colony optimization (ACO) algorithm to select the best features, and then we applied the well-known C4.5, naive Bayes, andknearest neighbor classifiers to assign class labels to web pages. We used the WebKB and Conference datasets in our experiments, and we showed that using the ACO for feature selection improves both accuracy and runtime performance of classification. We also showed that the proposed ACO based algorithm can select better features with respect to the well-known information gain and chi square feature selection methods.

Webology ◽  
2021 ◽  
Vol 18 (2) ◽  
pp. 225-242
Author(s):  
Chait hra ◽  
Dr.G.M. Lingaraju ◽  
Dr.S. Jagannatha

Nowadays, the Internet contain s a wide variety of online documents, making finding useful information about a given subject impossible, as well as retrieving irrelevant pages. Web document and page recognition software is useful in a variety of fields, including news, medicine, and fitness, research, and information technology. To enhance search capability, a large number of web page classification methods have been proposed, especially for news web pages. Furthermore existing classification approaches seek to distinguish news web pages while still reducing the high dimensionality of features derived from these pages. Due to the lack of automated classification methods, this paper focuses on the classification of news web pages based on their scarcity and importance. This work will establish different models for the identification and classification of the web pages. The data sets used in this paper were collected from popular news websites. In the research work we have used BBC dataset that has five predefined categories. Initially the input source can be preprocessed and the errors can be eliminated. Then the features can be extracted depend upon the web page reviews using Term frequency-inverse document frequency vectorization. In the work 2225 documents are represented with the 15286 features, which represents the tf-idf score for different unigrams and bigrams. This type of the representation is not only used for classification task also helpful to analyze the dataset. Feature selection is done by using the chi-squared test which will be in the task of finding the terms that are most correlated with each of the categories. Then the pointed features can be selected using chi-squared test. Finally depend upon the classifier the web page can be classified. The results showed that list has obtained the highest percentage, which reflect its effectiveness on the classification of web pages.


2019 ◽  
Vol 16 (2) ◽  
pp. 384-388 ◽  
Author(s):  
K. S. Ramanujam ◽  
K. David

Web page classification refers to one of the significant research are in the web mining domain. Enormous quantity of data existing in the web demands the essential development of various effective and robust techniques to undergo web mining task that involves the process to categorizing the web page based on the data labels. It also includes various other tasks such as web crawling, analysis of web links and contextual advertising process. Existing machine learning and data mining techniques are being efficiently used for various web mining processes which include classification of web pages. Using of multiple classifier techniques are most promising research area while considering machine learning that works on the base of merging various classifiers with difference in base classifier and/or dataset distribution. With this several classification models are constructed that is highly robust in nature. This review paper, comparison has been done between FA, PSO, ACO, GA and IWT, to evaluate best fit algorithm for classifying web pages.


2021 ◽  
Vol 9 (4) ◽  
pp. 963-973
Author(s):  
Suleyman Suleymanzade ◽  
Fargana Abdullayeva

The quality of the web page classification process has a huge impact on information retrieval systems. In this paper, we proposed to combine the results of text and image data classifiers to get an accurate representation of the web pages. To get and analyse the data we created the complicated classifier system with data miner, text classifier, and aggregator. The process of image and text data classification has been achieved by the deep learning models. In order to represent the common view onto the web pages, we proposed three aggregation techniques that combine the data from the classifiers.


The World revolves around the web technology at present. Every year, the Web information are exponentially growing and this information are huge and complex. The web users are difficult to classify and extract useful information from the web, because the Webinformation are noisy, redundant and irrelevant and also misclassified.Many researchers don’t have strongknowledge about the process of web page classification, techniques and methods previously used. The objective of this survey is to convey an outline of the modern techniques of Web page classification. In this survey, the recent papers in this area are selected and explored.Thus this study will help the researchers to obtain the required knowledge about the current trends in web page classification


2011 ◽  
pp. 1462-1477 ◽  
Author(s):  
K. Selvakuberan ◽  
M. Indra Devi ◽  
R. Rajaram

The World Wide Web serves as a huge, widely distributed, global information service center for news, advertisements, customer information, financial management, education, government, e-commerce and many others. The Web contains a rich and dynamic collection of hyperlink information. The Web page access and usage information provide rich sources for data mining. Web pages are classified based on the content and/or contextual information embedded in them. As the Web pages contain many irrelevant, infrequent, and stop words that reduce the performance of the classifier, selecting relevant representative features from the Web page is the essential preprocessing step. This provides secured accessing of the required information. The Web access and usage information can be mined to predict the authentication of the user accessing the Web page. This information may be used to personalize the information needed for the users and to preserve the privacy of the users by hiding the personal details. The issue lies in selecting the features which represent the Web pages and processing the details of the user needed the details. In this article we focus on the feature selection, issues in feature selections, and the most important feature selection techniques described and used by researchers.


Author(s):  
ALI SELAMAT ◽  
ZHI SAM LEE ◽  
MOHD AIZAINI MAAROF ◽  
SITI MARIYAM SHAMSUDDIN

In this paper, an improved web page classification method (IWPCM) using neural networks to identify the illicit contents of web pages is proposed. The proposed IWPCM approach is based on the improvement of feature selection of the web pages using class based feature vectors (CPBF). The CPBF feature selection approach has been calculated by considering the important term's weight for illicit web documents and reduce the dependency of the less important term's weight for normal web documents. The IWPCM approach has been examined using the modified term-weighting scheme by comparing it with several traditional term-weighting schemes for non-illicit and illicit web contents available from the web. The precision, recall, and F1 measures have been used to evaluate the effectiveness of the proposed IWPCM approach. The experimental results have shown that the proposed improved term-weighting scheme has been able to identify the non-illicit and illicit web contents available from the experimental datasets.


Author(s):  
Soner Kiziloluk ◽  
Ahmet Bedri Ozer

In recent years, data on the Internet has grown exponentially, attaining enormous dimensions. This situation makes it difficult to obtain useful information from such data. Web mining is the process of using data mining techniques such as association rules, classification, clustering, and statistics to discover and extract information from Web documents. Optimization algorithms play an important role in such techniques. In this work, the parliamentary optimization algorithm (POA), which is one of the latest social-based metaheuristic algorithms, has been adopted for Web page classification. Two different data sets (Course and Student) were selected for experimental evaluation, and HTML tags were used as features. The data sets were tested using different classification algorithms implemented in WEKA, and the results were compared with those of the POA. The POA was found to yield promising results compared to the other algorithms. This study is the first to propose the POA for effective Web page classification.


Sign in / Sign up

Export Citation Format

Share Document