scholarly journals An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling

2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
R. Suganya Devi ◽  
D. Manjula ◽  
R. K. Siddharth

Web Crawling has acquired tremendous significance in recent times and it is aptly associated with the substantial development of the World Wide Web. Web Search Engines face new challenges due to the availability of vast amounts of web documents, thus making the retrieved results less applicable to the analysers. However, recently, Web Crawling solely focuses on obtaining the links of the corresponding documents. Today, there exist various algorithms and software which are used to crawl links from the web which has to be further processed for future use, thereby increasing the overload of the analyser. This paper concentrates on crawling the links and retrieving all information associated with them to facilitate easy processing for other uses. In this paper, firstly the links are crawled from the specified uniform resource locator (URL) using a modified version of Depth First Search Algorithm which allows for complete hierarchical scanning of corresponding web links. The links are then accessed via the source code and its metadata such as title, keywords, and description are extracted. This content is very essential for any type of analyser work to be carried on the Big Data obtained as a result of Web Crawling.

Author(s):  
R. Subhashini ◽  
V.Jawahar Senthil Kumar

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.


2020 ◽  
pp. 143-158
Author(s):  
Chris Bleakley

Chapter 8 explores the arrival of the World Wide Web, Amazon, and Google. The web allows users to display “pages” of information retrieved from remote computers by means of the Internet. Inventor Tim Berners-Lee released the first web software for free, setting in motion an explosion in Internet usage. Seeing the opportunity of a lifetime, Jeff Bezos set-up Amazon as an online bookstore. Amazon’s success was accelerated by a product recommender algorithm that selectively targets advertising at users. By the mid-1990s there were so many web sites that users often couldn’t find what they were looking for. Stanford PhD student Larry Page invented an algorithm for ranking search results based on the importance and relevance of web pages. Page and fellow student, Sergey Brin, established a company to bring their search algorithm to the world. Page and Brin - the founders of Google - are now worth US$35-40 billion, each.


Author(s):  
Radek Burget ◽  
Pavel Smrz

Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.


Author(s):  
Esharenana E. Adomi

The World Wide Web (WWW) has led to the advent of the information age. With increased demand for information from various quarters, the Web has turned out to be a veritable resource. Web surfers in the early days were frustrated by the delay in finding the information they needed. The first major leap for information retrieval came from the deployment of Web search engines such as Lycos, Excite, AltaVista, etc. The rapid growth in the popularity of the Web during the past few years has led to a precipitous pronouncement of death for the online services that preceded the Web in the wired world.


Author(s):  
John E. De Villiers ◽  
André P. Calitz

The usefulness of a uniform resource locator (URL) on the World Wide Web is reliant on the resource being hosted at the same URL in perpetuity. When URLs are altered or removed, this results in the resource, such as an image or document, being inaccessible. While web-archiving projects seek to prevent such a loss of online resources, providing complete backups of the web remains a formidable challenge. This article outlines the initial development and testing of a decentralised application (DApp), provisionally named Repudiation Chain, as a potential tool to help address these challenges presented by shifting URLs and uncertain web-archiving. Repudiation Chain seeks to make use of a blockchain smart contract mechanism in order to allow individual users to contribute to web-archiving. Repudiation Chain aims to offer unalterable assurance that a specific file and its URL existed at a given point in time—by generating a compact, non-reversible representation of the file at the time of its non-repudiation. If widely adopted, such a tool could contribute to decentralisation and democratisation of web-archiving.


2013 ◽  
Vol 8 (3) ◽  
pp. 913-921 ◽  
Author(s):  
Noryusliza Abdullah ◽  
Rosziati Ibrahim

Semantic Web approach with the assistance of ontology is widely used to give more reliable application in retrieving information and knowledge.  It is capable to discover the World Wide Web (WWW) that is presented in natural-language text.  Based on previous research, incorporating categorization with ontology concept has proven to give better results.  However, performing hybrid of the search engine using another technique that is user profiling has a promising potency in enhancing the searching process.  Utilizing searching time and giving relevant results are the contributions of this research.  The proposed hybrid techniques integrate ontologies, categorization and user profiling concept.  In user profiling, similarity measure is adopted in making comparison between two different ontologies.  WordNet and UTHM Onto are the independent ontologies used in this process.  The preliminary experimental results have given interesting results in terms of data arrangement and time usage.


Author(s):  
Michael Chau ◽  
Yan Lu ◽  
Xiao Fang ◽  
Christopher C. Yang

More non-English contents are now available on the World Wide Web and the number of non-English users on the Web is increasing. While it is important to understand the Web searching behavior of these non-English users, many previous studies on Web query logs have focused on analyzing English search logs and their results may not be directly applied to other languages. In this Chapter we discuss some methods and techniques that can be used to analyze search queries in Chinese. We also show an example of applying our methods on a Chinese Web search engine. Some interesting findings are reported.


Author(s):  
Stewart T. Fleming

The open source software movement exists as a loose collection of individuals, organizations, and philosophies roughly grouped under the intent of making software source code as widely available as possible (Raymond, 1998). While the movement as such can trace its roots back more than 30 years to the development of academic software, the Internet, the World Wide Web, and so forth, the popularization of the movement grew significantly from the mid-80s (Naughton, 2000).


Antiquity ◽  
2002 ◽  
Vol 76 (293) ◽  
pp. 862-868 ◽  
Author(s):  
Ian Oxley

IntroductionIn the past Britain has been a global naval, mercantile and industrial power and, as an island which has benefited from successive waves of settlement, its history is inextricably linked to its surrounding seas (Lavery 2001). High volumes of shipping traffic and a long history of seafaring and warfare have contributed to a density of shipwreck remains in UK territorial waters which is likely to be amongst the highest in the world.Recently warship wrecks have been given a significantly higher degree of attention in the UK and world-wide, and the recent ‘scheduling’ of the German High Seas Fleet wrecks under the terms of the Ancient Monuments and Archaeological Areas Act 2979 (AMAA 1979) has led to new challenges in heritage management. At the same time as we are becoming aware of the value of these resources, the administrative, legislative, environmental and social frameworks in which they have to be managed are changing rapidly.


Sign in / Sign up

Export Citation Format

Share Document