CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

The Evolution of the (Hidden) Web and its Hidden Data

Advances in Multimedia and Interactive Technologies - Design Strategies and Innovations in Multimedia Presentations ◽

10.4018/978-1-4666-8696-0.ch001 ◽

2015 ◽

pp. 1-30 ◽

Cited By ~ 1

Author(s):

Manuel Álvarez Díaz ◽

Víctor Manuel Prieto Álvarez ◽

Fidel Cacheda Seijo

Keyword(s):

Large Subset ◽

Hidden Web ◽

Hidden Data ◽

The Web

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.

Download Full-text

Web Information Extraction via Web Views

Web Information Systems ◽

10.4018/978-1-59140-208-4.ch007 ◽

2004 ◽

pp. 227-267

Author(s):

Wee Keong Ng ◽

Zehua Liu ◽

Zhao Li ◽

Ee Peng Lim

Keyword(s):

Information Extraction ◽

Data Model ◽

Information Source ◽

Extraction Process ◽

Web Pages ◽

Efficient Manner ◽

Web Information Extraction ◽

Web Information ◽

Definition Of ◽

The Web

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.

Download Full-text

Information Retrieval in the Hidden Web

10.4018/978-1-7998-8061-5.ch003 ◽

2021 ◽

pp. 50-71

Author(s):

Shakeel Ahmed ◽

Shubham Sharma ◽

Saneh Lata Yadav

Keyword(s):

Information Retrieval ◽

Search Engines ◽

Deep Web ◽

Web Content ◽

Web Data ◽

Hidden Web ◽

Special Software ◽

Surface Web ◽

Dark Web

Information retrieval is finding material of unstructured nature within large collections stored on computers. Surface web consists of indexed content accessible by traditional browsers whereas deep or hidden web content cannot be found with traditional search engines and requires a password or network permissions. In deep web, dark web is also growing as new tools make it easier to navigate hidden content and accessible with special software like Tor. According to a study by Nature, Google indexes no more than 16% of the surface web and misses all of the deep web. Any given search turns up just 0.03% of information that exists online. So, the key part of the hidden web remains inaccessible to the users. This chapter deals with positing some questions about this research. Detailed definitions, analogies are explained, and the chapter discusses related work and puts forward all the advantages and limitations of the existing work proposed by researchers. The chapter identifies the need for a system that will process the surface and hidden web data and return integrated results to the users.

Download Full-text

Enhancing Web Search through Query Log Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch083 ◽

2011 ◽

pp. 438-442

Author(s):

Ji-Rong Wen

Keyword(s):

Information Retrieval ◽

Search Engine ◽

Web Mining ◽

Web Search ◽

Information Source ◽

Query Log ◽

Additional Information ◽

Query Logs ◽

Query Log Mining ◽

The Web

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.

Download Full-text

Applying Web Mining Techniques for Constructing Webometrics Ranking early Warning System

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.767 ◽

2012 ◽

Vol 532-533 ◽

pp. 767-771 ◽

Cited By ~ 1

Author(s):

Shu Ming Hsieh ◽

Ssu An Lo ◽

Chiun Chieh Hsu ◽

Da Ren Chen

Keyword(s):

Early Warning ◽

Web Sites ◽

Web Site ◽

Early Warning System ◽

Web Mining ◽

Information Source ◽

Scientific Information ◽

Warning System ◽

Internet Technology ◽

The Web

The management of university web sites is becoming more critical than before due to the rapid growth of the population dependent on the world wide web as the most important (if not the only) information source. A university can spread its research outcomes and education achievements through its web site, and consequently gain visibility and influence from the web population. Webometrics Ranking of World Universities (WR) proposed by Centre for Scientific Information and Documentation (CINDOC-CSIC), which ranks the university web sites, has obtained much attention recently. The rankings of WR are well recognized as an important index for universities willing to promote themselves by the internet technology. In this paper, we proposed WRES as an early warning system for Webometrics Rankings. WRES gathers the WR indices from the WWW automatically in flexible periods, and provides useful information in real time for the managers of university web sites. If the WR ranking of an institution is below the expected position according to their academic performance, university authorities should reconsider their web policy, by promoting substantial increases of the volume and quality of their electronic publications. Besides, the web site manages may adopt effective approaches to promote their WR rankings according to the hints given by WRES.

Download Full-text

Mining Hidden Gems beneath the Surface: A Look at the Invisible Web

Journal of Educational Technology Systems ◽

10.2190/eqcv-d0ed-3m1f-01km ◽

2001 ◽

Vol 30 (2) ◽

pp. 125-135 ◽

Cited By ~ 1

Author(s):

Randal D. Carlson ◽

Judi Repman

Keyword(s):

New Technologies ◽

General Information ◽

Computer Users ◽

Depth Information ◽

Information Searching ◽

The Future ◽

Information Landscape ◽

Surface Web ◽

The Web ◽

The Invisible

The future of the information landscape is being shaped by new technologies that store and retrieve information. For many computer users, the Web is their first and last stop in information searching. Searches produce an overwhelming amount of returns, but may have few that are “on target.” Finding general information may be easy, but depth of information is frequently lacking. This article focuses on describing resources for researchers, called the Invisible Web, that are hidden from usual search tools and contrasting them with those resources available in the surface Web. It then identifies search tools and strategies that can be used to dig beneath the surface of the Web to locate credible, in-depth information. These resources must be accessed using specialized search tools and databases.

Download Full-text

Exploring `hidden' parts of the web: the hidden web

Fourth International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom2012) ◽

10.1049/cp.2012.2556 ◽

2012 ◽

Cited By ~ 1

Author(s):

S. Gupta ◽

K.K. Bhatia

Keyword(s):

Hidden Web ◽

The Web

Download Full-text

Key word placing in Web page body text to increase visibility to search engines

SA Journal of Information Management ◽

10.4102/sajim.v9i1.16 ◽

2007 ◽

Vol 9 (1) ◽

Author(s):

W. T. Kritzinger ◽

M. Weideman

Keyword(s):

Key Words ◽

Body Text ◽

Search Engines ◽

Information Source ◽

The Body ◽

Web Page ◽

Business Services ◽

Daunting Task ◽

Key Word ◽

The Web

The growth of the World Wide Web has spawned a wide variety of new information sources, which has also left users with the daunting task of determining which sources are valid. Many users rely on the Web as an information source because of the low cost of information retrieval. It is also claimed that the Web has evolved into a powerful business tool. Examples include highly popular business services such as Amazon.com and Kalahari.net. It is estimated that around 80% of users utilize search engines to locate information on the Internet. This, by implication, places emphasis on the underlying importance of Web pages being listed on search engines indices. Empirical evidence that the placement of key words in certain areas of the body text will have an influence on the Web sites' visibility to search engines could not be found in the literature. The result of two experiments indicated that key words should be concentrated towards the top, and diluted towards the bottom of a Web page to increase visibility. However, care should be taken in terms of key word density, to prevent search engine algorithms from raising the spam alarm.

Download Full-text

Enhancing Web Search through Query Log Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch117 ◽

2011 ◽

pp. 758-763 ◽

Cited By ~ 2

Author(s):

Ji-Rong Wen

Keyword(s):

Information Retrieval ◽

Search Engine ◽

Web Mining ◽

Web Search ◽

Information Source ◽

Query Log ◽

Additional Information ◽

Query Logs ◽

Query Log Mining ◽

The Web

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.

Download Full-text

An Approach of Web Page Information Extraction

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2479 ◽

2013 ◽

Vol 347-350 ◽

pp. 2479-2482

Author(s):

Yao Hui Li ◽

Li Xia Wang ◽

Jian Xiong Wang ◽

Jie Yue ◽

Ming Zhan Zhao

Keyword(s):

Information Extraction ◽

Search Engine ◽

Information Source ◽

Web Pages ◽

Web Page ◽

Extraction Technology ◽

Page Segmentation ◽

The Web

The Web has become the largest information source, but the noise content is an inevitable part in any web pages. The noise content reduces the nicety of search engine and increases the load of server. Information extraction technology has been developed. Information extraction technology is mostly based on page segmentation. Through analyzed the existing method of page segmentation, an approach of web page information extraction is provided. The block node is identified by analyzing attributes of HTML tags. This algorithm is easy to implementation. Experiments prove its good performance.

Download Full-text