Wrapper Maintenance: A Machine Learning Approach

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Exploiting Enriched Knowledge of Web Network Structures

Enhancing Qualitative and Mixed Methods Research with Technology - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-6493-7.ch011 ◽

2015 ◽

pp. 255-286

Author(s):

Shalin Hai-Jew

Keyword(s):

Social Media ◽

Data Extraction ◽

Web Pages ◽

Network Structures ◽

Social Media Platform ◽

Special Software ◽

Testing Tool ◽

Media Platform ◽

Data Visualizations ◽

The Web

Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.

Download Full-text

Semi-Structured Data Extraction from Heterogenous Sources

Internet-Based Organizational Memory and Knowledge Management ◽

10.4018/978-1-878289-82-7.ch005 ◽

2000 ◽

pp. 83-102 ◽

Cited By ~ 9

Author(s):

Xiaoying Gao ◽

Leon Sterling

Keyword(s):

Knowledge Management ◽

World Wide ◽

Data Extraction ◽

Research Problem ◽

The Internet ◽

Web Pages ◽

Accessible Information ◽

The World ◽

Machine Readable ◽

The Universe

The World Wide Web is known as the “universe of network-accessible information, the embodiment of human knowledge” (W3C, 1999). Internet-based knowledge management aims to use the Internet as the world wide environment for knowledge publishing, searching, sharing, reusing, and integration, and to support collaboration and decision making. However, knowledge on the Internet is buried in documents. Most of the documents are written in languages for human readers. The knowledge contained therein cannot be easily accessed by computer programs such as knowledge management systems. In order to make the Internet “machine readable,” information extraction from Web pages becomes a crucial research problem.

Download Full-text

Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v9.i3.pp752-763 ◽

2018 ◽

Vol 9 (3) ◽

pp. 752 ◽

Cited By ~ 2

Author(s):

Ily Amalina Ahmad Sabri ◽

Mustafa Man

Keyword(s):

Data Extraction ◽

Extraction Process ◽

Structured Data ◽

Web Pages ◽

Web Page ◽

Web Data ◽

Web Documents ◽

Web Extraction ◽

Comparison Time ◽

The Web

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

Download Full-text

Using Logic Programming and XML Technologies for Data Extraction from Web Pages

Intelligence Integration in Distributed Knowledge Management ◽

10.4018/978-1-59904-576-4.ch002 ◽

2011 ◽

pp. 17-47

Author(s):

Amelia Badica ◽

Costin Badica ◽

Elvira Popescu

Keyword(s):

Data Management ◽

Logic Programming ◽

General Problem ◽

Data Extraction ◽

Inductive Learning ◽

Web Pages ◽

Information Provider ◽

Web Data Management ◽

Xml Technologies ◽

The Web

The Web is designed as a major information provider for the human consumer. However, information published on the Web is difficult to understand and reuse by a machine. In this chapter, we show how well established intelligent techniques based on logic programming and inductive learning combined with more recent XML technologies might help to improve the efficiency of the task of data extraction from Web pages. Our work can be seen as a necessary step of the more general problem of Web data management and integration.

Download Full-text

Extracting Ontology Properties from the Web-Tables

International Journal of Systems and Service-Oriented Engineering ◽

10.4018/jssoe.2012070104 ◽

2012 ◽

Vol 3 (3) ◽

pp. 64-77

Author(s):

Song-il Cha ◽

Z. M. Ma

Keyword(s):

Functional Property ◽

Structural Information ◽

Web Pages ◽

Transitive Property ◽

Ontology Inference ◽

Symmetric Property ◽

The Web ◽

Domain Property

Web-tables are ubiquitous in Web pages. Since tables themselves are organized structurally and semantically, they are good resources from which we can easily extract ontology. But, most Web-tables are designed for intuitive perception of humans, thus, it has a certain limit to interpret table content using only structural information of the table. So this paper focuses on the method for interpretation of table content based on semantic characteristics of the table. In order to obtain many property elements used for ontology inference, in this paper, the authors discuss how to extract ontology properties from Web-tables. The extracted properties include the following elements: Is-a relationship, class-instance relationship, triple, property domain, property range, symmetric property, transitive property, functional property, and inverse functional property, property for defining super-sub relationship. Through experiment, the authors show that their method can effectively extract property elements from Web-tables.

Download Full-text

Classification of Web Pages Using Machine Learning Techniques

Social Implications of Data Mining and Information Privacy ◽

10.4018/978-1-60566-196-4.ch008 ◽

2010 ◽

pp. 134-150

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Machine Learning ◽

Search Engines ◽

Research Problem ◽

Machine Learning Techniques ◽

The Internet ◽

Web Pages ◽

Current Research Problem ◽

Learning Techniques ◽

The Web

The explosive growth of the Web makes it a very useful information resource to all types of users. Today, everyone accesses the Internet for various purposes and retrieving the required information within the stipulated time is the major demand from users. Also, the Internet provides millions of Web pages for each and every search term. Getting interesting and required results from the Web becomes very difficult and turning the classification of Web pages into relevant categories is the current research topic. Web page classification is the current research problem that focuses on classifying the documents into different categories, which are used by search engines for producing the result. In this chapter we focus on different machine learning techniques and how Web pages can be classified using these machine learning techniques. The automatic classification of Web pages using machine learning techniques is the most efficient way used by search engines to provide accurate results to the users. Machine learning classifiers may also be trained to preserve the personal details from unauthenticated users and for privacy preserving data mining.

Download Full-text

Classification of Web Pages Using Machine Learning Techniques

Machine Learning ◽

10.4018/978-1-60960-818-7.ch105 ◽

2012 ◽

pp. 50-65 ◽

Cited By ~ 1

Author(s):

K. Selvakuberan ◽

M. Indra Devi ◽

R. Rajaram

Keyword(s):

Machine Learning ◽

Search Engines ◽

Research Problem ◽

Machine Learning Techniques ◽

The Internet ◽

Web Pages ◽

Current Research Problem ◽

Learning Techniques ◽

The Web

The explosive growth of the Web makes it a very useful information resource to all types of users. Today, everyone accesses the Internet for various purposes and retrieving the required information within the stipulated time is the major demand from users. Also, the Internet provides millions of Web pages for each and every search term. Getting interesting and required results from the Web becomes very difficult and turning the classification of Web pages into relevant categories is the current research topic. Web page classification is the current research problem that focuses on classifying the documents into different categories, which are used by search engines for producing the result. In this chapter we focus on different machine learning techniques and how Web pages can be classified using these machine learning techniques. The automatic classification of Web pages using machine learning techniques is the most efficient way used by search engines to provide accurate results to the users. Machine learning classifiers may also be trained to preserve the personal details from unauthenticated users and for privacy preserving data mining.

Download Full-text

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Cybernetics and Physics ◽

10.35470/2226-4116-2020-9-3-144-151 ◽

2020 ◽

pp. 144-151

Author(s):

Ngo Le Huy Hien ◽

Thai Quang Tien ◽

Nguyen Van Hieu

Keyword(s):

Search Engine ◽

Search Engines ◽

Visual Cues ◽

Future Research ◽

Web Pages ◽

Web Crawler ◽

Accessible Information ◽

Machine Learning Approach ◽

Engine Systems ◽

The Web

The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.

Download Full-text

Using XPaths of inbound links to cluster template-generated web pages

Computer Science and Information Systems ◽

10.2298/csis130416020g ◽

2014 ◽

Vol 11 (1) ◽

pp. 111-131

Author(s):

Tomas Grigalis ◽

Antanas Cenys

Keyword(s):

Real World ◽

Data Extraction ◽

Structural Similarity ◽

Structured Data ◽

Single Type ◽

Web Pages ◽

Template Structure ◽

Computationally Expensive ◽

Web Clustering ◽

The Web

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

Download Full-text