An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling

The Scientific World JOURNAL ◽

10.1155/2015/739286 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 6

Author(s):

R. Suganya Devi ◽

D. Manjula ◽

R. K. Siddharth

Keyword(s):

Big Data ◽

World Wide ◽

Web Search ◽

Search Algorithm ◽

Source Code ◽

Web Crawling ◽

Uniform Resource Locator ◽

Web Documents ◽

The World ◽

New Challenges

Web Crawling has acquired tremendous significance in recent times and it is aptly associated with the substantial development of the World Wide Web. Web Search Engines face new challenges due to the availability of vast amounts of web documents, thus making the retrieved results less applicable to the analysers. However, recently, Web Crawling solely focuses on obtaining the links of the corresponding documents. Today, there exist various algorithms and software which are used to crawl links from the web which has to be further processed for future use, thereby increasing the overload of the analyser. This paper concentrates on crawling the links and retrieving all information associated with them to facilitate easy processing for other uses. In this paper, firstly the links are crawled from the specified uniform resource locator (URL) using a modified version of Depth First Search Algorithm which allows for complete hierarchical scanning of corresponding web links. The links are then accessed via the source code and its metadata such as title, keywords, and description are extracted. This content is very essential for any type of analyser work to be carried on the Big Data obtained as a result of Web Crawling.

Download Full-text

A Roadmap to Integrate Document Clustering in Information Retrieval

Information Retrieval Methods for Multidisciplinary Applications ◽

10.4018/978-1-4666-3898-3.ch003 ◽

2013 ◽

pp. 31-45

Author(s):

R. Subhashini ◽

V.Jawahar Senthil Kumar

Keyword(s):

Information Retrieval ◽

Search Engines ◽

World Wide ◽

Clustering Algorithm ◽

Web Search ◽

Full Potential ◽

Digital Information ◽

Search Results ◽

The World ◽

The Web

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method’s feasibility and effectiveness.

Download Full-text

Googling the Web

Poems That Solve Puzzles ◽

10.1093/oso/9780198853732.003.0008 ◽

2020 ◽

pp. 143-158

Author(s):

Chris Bleakley

Keyword(s):

Web Sites ◽

World Wide ◽

Search Algorithm ◽

Internet Usage ◽

Web Pages ◽

The World ◽

Set Up ◽

Recommender Algorithm ◽

A Company ◽

The Web

Chapter 8 explores the arrival of the World Wide Web, Amazon, and Google. The web allows users to display “pages” of information retrieved from remote computers by means of the Internet. Inventor Tim Berners-Lee released the first web software for free, setting in motion an explosion in Internet usage. Seeing the opportunity of a lifetime, Jeff Bezos set-up Amazon as an online bookstore. Amazon’s success was accelerated by a product recommender algorithm that selectively targets advertising at users. By the mid-1990s there were so many web sites that users often couldn’t find what they were looking for. Stanford PhD student Larry Page invented an algorithm for ranking search results based on the importance and relevance of web pages. Page and fellow student, Sergey Brin, established a company to bring their search algorithm to the world. Page and Brin - the founders of Google - are now worth US$35-40 billion, each.

Download Full-text

Extracting Visually Presented Element Relationships from Web Documents

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.2013040102 ◽

2013 ◽

Vol 7 (2) ◽

pp. 13-29

Author(s):

Radek Burget ◽

Pavel Smrz

Keyword(s):

Information Retrieval ◽

World Wide ◽

Visual Presentation ◽

Generic Model ◽

Individual Data ◽

Web Documents ◽

The World ◽

Structured Information ◽

The Individual ◽

Document Format

Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, the authors propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applications. The authors formally define the model, the authors introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and the authors discuss the expected applications. The authors also present a new dataset consisting of programmes of conferences and other scientific events and the authors discuss its suitability for the task in hand. Finally, the authors use the dataset to evaluate results of the implemented system.

Download Full-text

African Web Portals

Encyclopedia of Portal Technologies and Applications ◽

10.4018/978-1-59140-989-2.ch007 ◽

2011 ◽

pp. 41-46 ◽

Cited By ~ 1

Author(s):

Esharenana E. Adomi

Keyword(s):

Information Retrieval ◽

World Wide Web ◽

Rapid Growth ◽

World Wide ◽

Web Search ◽

Information Age ◽

Web Portals ◽

The Past ◽

The World ◽

The Web

The World Wide Web (WWW) has led to the advent of the information age. With increased demand for information from various quarters, the Web has turned out to be a veritable resource. Web surfers in the early days were frustrated by the delay in finding the information they needed. The first major leap for information retrieval came from the deployment of Web search engines such as Lycos, Excite, AltaVista, etc. The rapid growth in the popularity of the Web during the past few years has led to a precipitous pronouncement of death for the online services that preceded the Web in the wired world.

Download Full-text

A Supplementary Tool for Web-archiving Using Blockchain Technology

The African Journal of Information and Communication ◽

10.23962/10539/29194 ◽

2020 ◽

pp. 1-14

Author(s):

John E. De Villiers ◽

André P. Calitz

Keyword(s):

World Wide ◽

Uniform Resource Locator ◽

Web Archiving ◽

Initial Development ◽

Blockchain Technology ◽

Smart Contract ◽

Formidable Challenge ◽

The World ◽

Potential Tool ◽

The Web

The usefulness of a uniform resource locator (URL) on the World Wide Web is reliant on the resource being hosted at the same URL in perpetuity. When URLs are altered or removed, this results in the resource, such as an image or document, being inaccessible. While web-archiving projects seek to prevent such a loss of online resources, providing complete backups of the web remains a formidable challenge. This article outlines the initial development and testing of a decentralised application (DApp), provisionally named Repudiation Chain, as a potential tool to help address these challenges presented by shifting URLs and uncertain web-archiving. Repudiation Chain seeks to make use of a blockchain smart contract mechanism in order to allow individual users to contribute to web-archiving. Repudiation Chain aims to offer unalterable assurance that a specific file and its URL existed at a given point in time—by generating a compact, non-reversible representation of the file at the time of its non-repudiation. If widely adopted, such a tool could contribute to decentralisation and democratisation of web-archiving.

Download Full-text

Similarity Measurement in the Hybrid of Semantic Web Search Engine

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v8i3.3403 ◽

2013 ◽

Vol 8 (3) ◽

pp. 913-921 ◽

Cited By ~ 1

Author(s):

Noryusliza Abdullah ◽

Rosziati Ibrahim

Keyword(s):

Semantic Web ◽

Search Engine ◽

World Wide ◽

Web Search ◽

User Profiling ◽

Similarity Measurement ◽

Hybrid Techniques ◽

Time Usage ◽

The World ◽

Data Arrangement

Semantic Web approach with the assistance of ontology is widely used to give more reliable application in retrieving information and knowledge.Â It is capable to discover the World Wide Web (WWW) that is presented in natural-language text.Â Based on previous research, incorporating categorization with ontology concept has proven to give better results.Â However, performing hybrid of the search engine using another technique that is user profiling has a promising potency in enhancing the searching process.Â Utilizing searching time and giving relevant results are the contributions of this research.Â The proposed hybrid techniques integrate ontologies, categorization and user profiling concept.Â In user profiling, similarity measure is adopted in making comparison between two different ontologies.Â WordNet and UTHM Onto are the independent ontologies used in this process.Â The preliminary experimental results have given interesting results in terms of data arrangement and time usage.

Download Full-text

Processing and Analysis of Search Query Logs in Chinese

Handbook of Research on Web Log Analysis ◽

10.4018/978-1-59904-974-8.ch019 ◽

2011 ◽

pp. 378-388 ◽

Cited By ~ 1

Author(s):

Michael Chau ◽

Yan Lu ◽

Xiao Fang ◽

Christopher C. Yang

Keyword(s):

World Wide ◽

Web Search ◽

Searching Behavior ◽

Web Searching ◽

Search Queries ◽

Web Search Engine ◽

The World ◽

Query Logs ◽

Search Logs ◽

The Web

More non-English contents are now available on the World Wide Web and the number of non-English users on the Web is increasing. While it is important to understand the Web searching behavior of these non-English users, many previous studies on Web query logs have focused on analyzing English search logs and their results may not be directly applied to other languages. In this Chapter we discuss some methods and techniques that can be used to analyze search queries in Chinese. We also show an example of applying our methods on a Chinese Web search engine. Some interesting findings are reported.

Download Full-text

Open Source Intellectual Property Rights

Encyclopedia of Multimedia Technology and Networking ◽

10.4018/978-1-59140-561-0.ch111 ◽

2011 ◽

pp. 785-790

Author(s):

Stewart T. Fleming

Keyword(s):

Intellectual Property ◽

World Wide Web ◽

Property Rights ◽

Open Source ◽

Intellectual Property Rights ◽

Open Source Software ◽

World Wide ◽

Source Code ◽

The Internet ◽

The World

The open source software movement exists as a loose collection of individuals, organizations, and philosophies roughly grouped under the intent of making software source code as widely available as possible (Raymond, 1998). While the movement as such can trace its roots back more than 30 years to the development of academic software, the Internet, the World Wide Web, and so forth, the popularization of the movement grew significantly from the mid-80s (Naughton, 2000).

Download Full-text

New Challenges in the World Wide Wisdom Web (W4) Research

Lecture Notes in Computer Science - Foundations of Intelligent Systems ◽

10.1007/978-3-540-39592-8_1 ◽

2003 ◽

pp. 1-6 ◽

Cited By ~ 2

Author(s):

Jiming Liu

Keyword(s):

World Wide ◽

The World ◽

New Challenges

Download Full-text

Scapa Flow and the protection and management of Scotland's historic military shipwrecks

Antiquity ◽

10.1017/s0003598x00091353 ◽

2002 ◽

Vol 76 (293) ◽

pp. 862-868 ◽

Cited By ~ 1

Author(s):

Ian Oxley

Keyword(s):

World Wide ◽

Heritage Management ◽

The Past ◽

The World ◽

History Of ◽

New Challenges ◽

Shipping Traffic ◽

Archaeological Areas ◽

Ancient Monuments ◽

The Uk

IntroductionIn the past Britain has been a global naval, mercantile and industrial power and, as an island which has benefited from successive waves of settlement, its history is inextricably linked to its surrounding seas (Lavery 2001). High volumes of shipping traffic and a long history of seafaring and warfare have contributed to a density of shipwreck remains in UK territorial waters which is likely to be amongst the highest in the world.Recently warship wrecks have been given a significantly higher degree of attention in the UK and world-wide, and the recent ‘scheduling’ of the German High Seas Fleet wrecks under the terms of the Ancient Monuments and Archaeological Areas Act 2979 (AMAA 1979) has led to new challenges in heritage management. At the same time as we are becoming aware of the value of these resources, the administrative, legislative, environmental and social frameworks in which they have to be managed are changing rapidly.

Download Full-text