scholarly journals A Semantic Focused Web Crawler Based on a Knowledge Representation Schema

2020 ◽  
Vol 10 (11) ◽  
pp. 3837
Author(s):  
Julio Hernandez ◽  
Heidy M. Marin-Castro ◽  
Miguel Morales-Sandoval

The Web has become the main source of information in the digital world, expanding to heterogeneous domains and continuously growing. By means of a search engine, users can systematically search over the web for particular information based on a text query, on the basis of a domain-unaware web search tool that maintains real-time information. One type of web search tool is the semantic focused web crawler (SFWC); it exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. An SFWC is highly dependent on the ontological resource, which is created by domain human experts. This work presents a novel SFWC based on a generic knowledge representation schema to model the crawler’s domain, thus reducing the complexity and cost of constructing a more formal representation as the case when using ontologies. Furthermore, a similarity measure based on the combination of the inverse document frequency (IDF) metric, standard deviation, and the arithmetic mean is proposed for the SFWC. This measure filters web page contents in accordance with the domain of interest during the crawling task. A set of experiments were run over the domains of computer science, politics, and diabetes to validate and evaluate the proposed novel crawler. The quantitative (harvest ratio) and qualitative (Fleiss’ kappa) evaluations demonstrate the suitability of the proposed SFWC to crawl the Web using a knowledge representation schema instead of a domain ontology.

Author(s):  
Reinaldo Padilha França ◽  
Ana Carolina Borges Monteiro ◽  
Rangel Arthur ◽  
Yuzo Iano

The Semantic Web concept is an extension of the web obtained by adding semantics to the current data representation format. It is considered a network of correlating meanings. It is the result of a combination of web-based conceptions and technologies and knowledge representation. Since the internet has gone through many changes and steps in its web versions 1.0, 2.0, and Web 3.0, this last call of smart web, the concept of Web 3.0, is to be associated with the Semantic Web, since technological advances have allowed the internet to be present beyond the devices that were made exactly with the intention of receiving the connection, not limited to computers or smartphones since it has the concept of reading, writing, and execution off-screen, performed by machines. Therefore, this chapter aims to provide an updated review of Semantic Web and its technologies showing its technological origins and approaching its success relationship with a concise bibliographic background, categorizing and synthesizing the potential of technologies.


Author(s):  
Christopher Walton

At the start of this book we outlined the challenges of automatic computer based processing of information on the Web. These numerous challenges are generally referred to as the ‘vision’ of the Semantic Web. From the outset, we have attempted to take a realistic and pragmatic view of this vision. Our opinion is that the vision may never be fully realized, but that it is a useful goal on which to focus. Each step towards the vision has provided new insights on classical problems in knowledge representation, MASs, and Web-based techniques. Thus, we are presently in a significantly better position as a result of these efforts. It is sometimes difficult to see the purpose of the Semantic Web vision behind all of the different technologies and acronyms. However, the fundamental purpose of the Semantic Web is essentially large scale and automated data integration. The Semantic Web is not just about providing a more intelligent kind of Web search, but also about taking the results of these searches and combining them in interesting and useful ways. As stated in Chapter 1, the possible applications for the Semantic Web include: automated data mining, e-science experiments, e-learning systems, personalized newspapers and journals, and intelligent devices. The current state of progress towards the Semantic Web vision is summarized in Figure 8.1. This figure shows a pyramid with the human-centric Web at the bottom, sometimes termed the Syntactic Web, and the envisioned Semantic Web at the top. Throughout this book, we have been moving upwards on this pyramid, and it should be clear that a great deal of progress that has been made towards the goal. This progress is indicated by the various stages of the pyramid, which can be summarized as follows: • The lowest stage on the pyramid is the basic Web that should be familiar to everyone. This Web of information is human-centric and contains very little automation. Nonetheless, the Web provides the basic protocols and technologies on which the Semantic Web is founded. Furthermore, the information which is represented on the Web will ultimately be the source of knowledge for the Semantic Web.


2010 ◽  
Vol 129-131 ◽  
pp. 670-674
Author(s):  
Xu Jing ◽  
Dong Jian He ◽  
Lin Sen Zan ◽  
Jian Liang Li ◽  
Wang Yao

In management-type SaaS, user must be permitted to submit tenant’s business data on the SP's server, which may be embedded by the web-based malware. In this paper, we propose the automatic detecting method of web-based malware based on behavior analysis, which can make sure to meet the SLA by detecting the web-based malware actively. First, tenant’s update is downloaded to the bastion host by the web crawler. Second, it detect the behavior that tenant’s update is opened by IE. In order to break the malicious behavior during detecting, the IE has been injected in the DLL. Last, if the sensitive operations happen, the URL is appended to the malicious address database, and at same time the system administrator is informed by the SMS. The result of test is shown that our method can detect the web-based malware accurately. It helps to improve the service level of the management-type SaaS.


1999 ◽  
Vol 08 (02) ◽  
pp. 137-156 ◽  
Author(s):  
CHING-CHI HSU ◽  
CHIA-HUI CHANG

This paper describes a Web information search tool called WebYacht. The goal of WebYacht is to solve the problem of imprecise search results in current Web search engines. Due to incomplete information given by users and the diversified information published on the Web, conventional document ranking based on an automatic assessment of document relevance to the query may not be the best approach when little information is given as in most cases. In order to clarify the ambiguity of the short queries given by users, WebYacht adopts cluster-based browsing model as well as relevance feedback to facilitate Web information search. The idea is to have users give two to three times more feedback in the same amount of time that would be required to give feedback for conventional feedback mechanisms. With the assistance of cluster-based representation provided by WebYacht, a lot of browsing labor can be reduced. In this paper, we explain the techniques used in the design of WebYacht and compare the performances of feedback interface designs and to conventional similarity ranking search results.


Author(s):  
GAURAV AGARWAL ◽  
SACHI GUPTA ◽  
SAURABH MUKHERJEE

Today, web servers, are the key repositories of the information & internet is the source of getting this information. There is a mammoth data on the Internet. It becomes a difficult job to search out the accordant data. Search Engine plays a vital role in searching the accordant data. A search engine follows these steps: Web crawling by crawler, Indexing by Indexer and Searching by Searcher. Web crawler retrieves information of the web pages by following every link on the site. Which is stored by web search engine then the content of the web page is indexed by the indexer. The main role of indexer is how data can be catch soon as per user requirements. As the client gives a query, Search Engine searches the results corresponding to this query to provide excellent output. Here ambition is to enroot an algorithm for search engine which may response most desirable result as per user requirement. In this a ranking method is used by the search engine to rank the web pages. Various ranking approaches are discussed in literature but in this paper, ranking algorithm is proposed which is based on parent-child relationship. Proposed ranking algorithm is based on priority assignment phase of Heterogeneous Earliest Finish Time (HEFT) Algorithm which is designed for multiprocessor task scheduling. Proposed algorithm works on three on range variable its means the density of keywords, number of successors to the nodes and the age of the web page. Density shows the occurrence of the keyword on the particular web page. Numbers of successors represent the outgoing link to a single web page. Age is the freshness value of the web page. The page which is modified recently is the freshest page and having the smallest age or largest freshness value. Proposed Technique requires that the priorities of each page to be set with the downward rank values & pages are arranged in ascending/ Descending order of their rank values. Experiments show that our algorithm is valuable. After the comparison with Google we find that our Algorithm is performing better. For 70% problems our algorithm is working better than Google.


2021 ◽  
pp. 54-65
Author(s):  
admin admin ◽  
◽  
◽  
◽  
Khlid M. .. ◽  
...  

Most people are more or less related to the web by participating in a kind of social networking site. Semantic Web technology plays a crucial role in these sites as they contain an enormous amount of data about ‎persons, pages, events, places, corporations, etc. This research is a Semantic Web application designed to create a new ‎semantic social community called Socialpedia. It links the already existing social public information to the newly ‎public ones. This information is linked with different information on the web to construct a new immense ‎data container. The resulting data container can be processed using a variety of Semantic Web techniques to produce ‎machine-understandable content. This content shows the promise of using integrated data to improve Web search and ‎Web-scale data analysis, unlike conventional search engines or social ones. This community involves obtaining data ‎from traditional users known as contributors or participants, linking data from existing social networks, extracting ‎structured data in triples using predefined ontologies, and finally querying and inferring such data to obtain ‎meaningful pieces of information. Socailpedia supports all popular functionalities of social networking websites ‎besides the enhanced features of the Semantic Web, providing advanced semantic search that acts as a semantic ‎search engine.


2017 ◽  
Author(s):  
Don L Jewett

"Publication forms the core structure supporting the development and transmission of scientific knowledge" (Galbraith2015). Yet, with the WorldWideWeb a dominant part of current scientific publication and information-dissemination, internet "publication" is still paper-based in its style and methods. As will become painfully obvious, such a paper-based "publishing model" is NOT adequate for a Web-based world. Consider that in 2011, an estimated 5,000 peer-reviewed scientific articles were published per day (Outsell2013), and that in 2014 just the English-language scholarly publications on the Web were about 4,900 per day. In 1980, the distinguished scientist Garrett Hardin wrote [Hardin1980]:"Who can keep up with such a torrent? When I was young and foolish I vowed that I would read all the articles in my small field of science. Discovering that this was impossible, I tried to read all the abstracts. That, too, proved too much. Now I know that I cannot even read all the titles." To help reduce scholarly information-overload, this article proposes using Knowledge-Step Forums for the purpose of creating a new type scholarly publication, Web-based Compendia. Each Compendium is about a very narrow topic and is presented in a MultiLevel Format. When all these features are combined, the scholarly article is called a Knowledge-Step Compendium, and it is posted on the Web by the scholar, either on an institutional server, or on one of many web-hosting servers. Web-search engines will be automatically notified about the new posting (and later changes, too). Forum-Compendors need not be a senior faculty member (as is the case in traditional literature-reviews), but can be pre-docs, post-docs, and senior medical/surgical residents. These graduate-students will be aided by their mentors and online experts to create these Knowledge-Step Compendia. All participants (students and faculty) will be motivated by their own self-interest and everyone gains from the activity, which self-organizes groups of like-minded scholars. Such groups can be the basis for early reviews of new data, for discovering new ideas, and for finding jobs. Knowledge-Step Forums will speed publication on the Web because it will easily support Publication of Preprints using the software's automatic collection of online "peer-review" comments. In order for the Internet to be an efficient searchable repository of current and developing knowledge, one additional feature will be needed: ForwardLinks must be available in any given publication to those articles that, in the future, cite the given publication, as fully described in a Supplement to this article. Open-source software for this functionality should be on all Web-servers that contain scholarly articles, so as to make the WWW a distributed web full of linkages, of both ForwardLinks and RetroLinks.


Author(s):  
Sang Thanh Thi Nguyen ◽  
Tuan Thanh Nguyen

With the rapid advancement of ICT technology, the World Wide Web (referred to as the Web) has become the biggest information repository whose volume keeps growing on a daily basis. The challenge is how to find the most wanted information from the Web with a minimum effort. This paper presents a novel ontology-based framework for searching the related web pages to a given term within a few given specific websites. With this framework, a web crawler first learns the content of web pages within the given websites, then the topic modeller finds the relations between web pages and topics via key words found on the web pages using the Latent Dirichlet Allocation (LDA) technique. After that, the ontology builder establishes an ontology which is a semantic network of web pages based on the topic model. Finally, a reasoner can find the related web pages to a given term by making use of the ontology. The framework and related modelling techniques have been verified using a few test websites and the results convince its superiority over the existing web search tools.


2017 ◽  
Author(s):  
Don L Jewett

"Publication forms the core structure supporting the development and transmission of scientific knowledge" (Galbraith2015). Yet, with the WorldWideWeb a dominant part of current scientific publication and information-dissemination, internet "publication" is still paper-based in its style and methods. As will become painfully obvious, such a paper-based "publishing model" is NOT adequate for a Web-based world. Consider that in 2011, an estimated 5,000 peer-reviewed scientific articles were published per day (Outsell2013), and that in 2014 just the English-language scholarly publications on the Web were about 4,900 per day. In 1980, the distinguished scientist Garrett Hardin wrote [Hardin1980]:"Who can keep up with such a torrent? When I was young and foolish I vowed that I would read all the articles in my small field of science. Discovering that this was impossible, I tried to read all the abstracts. That, too, proved too much. Now I know that I cannot even read all the titles." To help reduce scholarly information-overload, this article proposes using Knowledge-Step Forums for the purpose of creating a new type scholarly publication, Web-based Compendia. Each Compendium is about a very narrow topic and is presented in a MultiLevel Format. When all these features are combined, the scholarly article is called a Knowledge-Step Compendium, and it is posted on the Web by the scholar, either on an institutional server, or on one of many web-hosting servers. Web-search engines will be automatically notified about the new posting (and later changes, too). Forum-Compendors need not be a senior faculty member (as is the case in traditional literature-reviews), but can be pre-docs, post-docs, and senior medical/surgical residents. These graduate-students will be aided by their mentors and online experts to create these Knowledge-Step Compendia. All participants (students and faculty) will be motivated by their own self-interest and everyone gains from the activity, which self-organizes groups of like-minded scholars. Such groups can be the basis for early reviews of new data, for discovering new ideas, and for finding jobs. Knowledge-Step Forums will speed publication on the Web because it will easily support Publication of Preprints using the software's automatic collection of online "peer-review" comments. In order for the Internet to be an efficient searchable repository of current and developing knowledge, one additional feature will be needed: ForwardLinks must be available in any given publication to those articles that, in the future, cite the given publication, as fully described in a Supplement to this article. Open-source software for this functionality should be on all Web-servers that contain scholarly articles, so as to make the WWW a distributed web full of linkages, of both ForwardLinks and RetroLinks.


Author(s):  
R. Umagandhi ◽  
A. V. Senthil Kumar

Web is the largest and voluminous data source in the world. The inconceivable boom of information available in the web simultaneously throws the challenge of retrieving the precise and appropriate information at the time of need. The unpredictable amount of web information available becomes a menace of experiencing ambiguity in the web search. In this scenario, Search engine retrieves significant information from the web, based on the query term given by the user. The search queries given by the user are always short and ambiguous and the queries may not produce the appropriate results. The retrieved result may not be relevant all the time. At times irrelevant and redundant results are also retrieved because of the short and ambiguous query keywords. Query Recommendation is a technique to provide the alternate queries as a substitute of the input query to the user to frame the queries in future. A methodology was framed to identify the similar queries and they are clustered; this cluster contains the similar queries which are used to provide the recommendations.


Sign in / Sign up

Export Citation Format

Share Document