scholarly journals CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Lámpsakos ◽  
2015 ◽  
pp. 39
Author(s):  
Fernando Pech-May ◽  
Alicia Martínez-Rebollar ◽  
Hugo Estrada-Esquivel ◽  
Eduardo Pedroza-Landa

The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web.

Author(s):  
Manuel Álvarez Díaz ◽  
Víctor Manuel Prieto Álvarez ◽  
Fidel Cacheda Seijo
Keyword(s):  

This paper presents an analysis of the most important features of the Web and its evolution and implications on the tools that traverse it to index its content to be searched later. It is important to remark that some of these features of the Web make a quite large subset to remain “hidden”. The analysis of the Web focuses on a snapshot of the Global Web for six different years: 2009 to 2014. The results for each year are analyzed independently and together to facilitate the analysis of both the features at any given time and the changes between the different analyzed years. The objective of the analysis are twofold: to characterize the Web and more importantly, its evolution along the time.


2004 ◽  
pp. 227-267
Author(s):  
Wee Keong Ng ◽  
Zehua Liu ◽  
Zhao Li ◽  
Ee Peng Lim

With the explosion of information on the Web, traditional ways of browsing and keyword searching of information over web pages no longer satisfy the demanding needs of web surfers. Web information extraction has emerged as an important research area that aims to automatically extract information from target web pages and convert them into a structured format for further processing. The main issues involved in the extraction process include: (1) the definition of a suitable extraction language; (2) the definition of a data model representing the web information source; (3) the generation of the data model, given a target source; and (4) the extraction and presentation of information according to a given data model. In this chapter, we discuss the challenges of these issues and the approaches that current research activities have taken to revolve these issues. We propose several classification schemes to classify existing approaches of information extraction from different perspectives. Among the existing works, we focus on the Wiccap system — a software system that enables ordinary end-users to obtain information of interest in a simple and efficient manner by constructing personalized web views of information sources.


2021 ◽  
pp. 50-71
Author(s):  
Shakeel Ahmed ◽  
Shubham Sharma ◽  
Saneh Lata Yadav

Information retrieval is finding material of unstructured nature within large collections stored on computers. Surface web consists of indexed content accessible by traditional browsers whereas deep or hidden web content cannot be found with traditional search engines and requires a password or network permissions. In deep web, dark web is also growing as new tools make it easier to navigate hidden content and accessible with special software like Tor. According to a study by Nature, Google indexes no more than 16% of the surface web and misses all of the deep web. Any given search turns up just 0.03% of information that exists online. So, the key part of the hidden web remains inaccessible to the users. This chapter deals with positing some questions about this research. Detailed definitions, analogies are explained, and the chapter discusses related work and puts forward all the advantages and limitations of the existing work proposed by researchers. The chapter identifies the need for a system that will process the surface and hidden web data and return integrated results to the users.


Author(s):  
Ji-Rong Wen

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.


2012 ◽  
Vol 532-533 ◽  
pp. 767-771 ◽  
Author(s):  
Shu Ming Hsieh ◽  
Ssu An Lo ◽  
Chiun Chieh Hsu ◽  
Da Ren Chen

The management of university web sites is becoming more critical than before due to the rapid growth of the population dependent on the world wide web as the most important (if not the only) information source. A university can spread its research outcomes and education achievements through its web site, and consequently gain visibility and influence from the web population. Webometrics Ranking of World Universities (WR) proposed by Centre for Scientific Information and Documentation (CINDOC-CSIC), which ranks the university web sites, has obtained much attention recently. The rankings of WR are well recognized as an important index for universities willing to promote themselves by the internet technology. In this paper, we proposed WRES as an early warning system for Webometrics Rankings. WRES gathers the WR indices from the WWW automatically in flexible periods, and provides useful information in real time for the managers of university web sites. If the WR ranking of an institution is below the expected position according to their academic performance, university authorities should reconsider their web policy, by promoting substantial increases of the volume and quality of their electronic publications. Besides, the web site manages may adopt effective approaches to promote their WR rankings according to the hints given by WRES.


2001 ◽  
Vol 30 (2) ◽  
pp. 125-135 ◽  
Author(s):  
Randal D. Carlson ◽  
Judi Repman

The future of the information landscape is being shaped by new technologies that store and retrieve information. For many computer users, the Web is their first and last stop in information searching. Searches produce an overwhelming amount of returns, but may have few that are “on target.” Finding general information may be easy, but depth of information is frequently lacking. This article focuses on describing resources for researchers, called the Invisible Web, that are hidden from usual search tools and contrasting them with those resources available in the surface Web. It then identifies search tools and strategies that can be used to dig beneath the surface of the Web to locate credible, in-depth information. These resources must be accessed using specialized search tools and databases.


2007 ◽  
Vol 9 (1) ◽  
Author(s):  
W. T. Kritzinger ◽  
M. Weideman

The growth of the World Wide Web has spawned a wide variety of new information sources, which has also left users with the daunting task of determining which sources are valid. Many users rely on the Web as an information source because of the low cost of information retrieval. It is also claimed that the Web has evolved into a powerful business tool. Examples include highly popular business services such as Amazon.com and Kalahari.net. It is estimated that around 80% of users utilize search engines to locate information on the Internet. This, by implication, places emphasis on the underlying importance of Web pages being listed on search engines indices. Empirical evidence that the placement of key words in certain areas of the body text will have an influence on the Web sites' visibility to search engines could not be found in the literature. The result of two experiments indicated that key words should be concentrated towards the top, and diluted towards the bottom of a Web page to increase visibility. However, care should be taken in terms of key word density, to prevent search engine algorithms from raising the spam alarm.


Author(s):  
Ji-Rong Wen

Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.


2013 ◽  
Vol 347-350 ◽  
pp. 2479-2482
Author(s):  
Yao Hui Li ◽  
Li Xia Wang ◽  
Jian Xiong Wang ◽  
Jie Yue ◽  
Ming Zhan Zhao

The Web has become the largest information source, but the noise content is an inevitable part in any web pages. The noise content reduces the nicety of search engine and increases the load of server. Information extraction technology has been developed. Information extraction technology is mostly based on page segmentation. Through analyzed the existing method of page segmentation, an approach of web page information extraction is provided. The block node is identified by analyzing attributes of HTML tags. This algorithm is easy to implementation. Experiments prove its good performance.


Sign in / Sign up

Export Citation Format

Share Document