A Semantic Topic Identification System for Document Retrieval on the Web

Author(s):  
Pasquale Capasso ◽  
Carmine Cesarano ◽  
Antonio Picariello ◽  
Lucio Sansone
2020 ◽  
Vol 4 (3) ◽  
pp. 551-557
Author(s):  
Muhammad zaky ramadhan ◽  
Kemas Muslim Lhaksmana

Hadith has several levels of authenticity, among which are weak (dhaif), and fabricated (maudhu) hadith that may not originate from the prophet Muhammad PBUH, and thus should not be considered in concluding an Islamic law (sharia). However, many such hadiths have been commonly confused as authentic hadiths among ordinary Muslims. To easily distinguish such hadiths, this paper proposes a method to check the authenticity of a hadith by comparing them with a collection of fabricated hadiths in Indonesian. The proposed method applies the vector space model and also performs spelling correction using symspell to check whether the use of spelling check can improve the accuracy of hadith retrieval, because it has never been done in previous works and typos are common on Indonesian-translated hadiths on the Web and social media raw text. The experiment result shows that the use of spell checking improves the mean average precision and recall to become 81% (from 73%) and 89% (from 80%), respectively. Therefore, the improvement in accuracy by implementing spelling correction make the hadith retrieval system more feasible and encouraged to be implemented in future works because it can correct typos that are common in the raw text on the Internet.


Author(s):  
Yuan Mao Huang ◽  
Chung-Cheng Liao

This is a student design project to present the procedures and the results of conceptual design for identification systems. The sub-function structure of the identification system is created after recognizing the requirement and establishing the specification. The physical effects, physical principles and solution principles are found based on the sub-functions, and the alternatives or combined solution principles are generated. The Saaty method with the modified normalized values is used to determine the relative importance or weighting factors of the standard evaluation items by paired comparisons. The eigenvalues and eigenvectors of the evaluation items and alternatives are determined. The web method is then used to determine the most preferred design of the alternatives and the best alternative is recommended. It is learned that to determine the sub-functions, the physical effects, physical principles, solution principles and combined solution principles, surveys of evaluation items, matrices of evaluation items and alternatives are very difficult, tedious and time consuming.


Author(s):  
Louis Massey ◽  
Wilson Wong

This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows ‘meaning’ to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.


2015 ◽  
Vol 13 (1) ◽  
pp. 31-41 ◽  
Author(s):  
Rajendra Prasath ◽  
Vijai Kumar ◽  
Sudeshna Sarkar

2019 ◽  
Vol 9 (3) ◽  
pp. 12-22
Author(s):  
Imtiaz Hussain Khan ◽  
Muazzam Ahmed Siddiqui ◽  
Kamal M. Jambi

This article describes a plagiarism detection system for the Arabic language that combines different similarity-measure techniques to uncover plagiarism in Arabic documents. The proposed system consists of two main components, one document-retrieval and the other detailed similarity analysis. The document-retrieval component generates queries from a given suspicious document and makes use of Google search API to retrieve candidate source documents from the Web. The similarity analysis component takes each source document in turn and attempts to identify the plagiarized parts in the suspicious document. The proposed system is thoroughly evaluated using an indigenous corpus. At the document-retrieval level, the system achieved above 75% accuracy in terms of f-score, whereas at the detailed similarity-computation level, the f-score is above 70%.


2016 ◽  
Vol 25 (03) ◽  
pp. 1650018 ◽  
Author(s):  
Andreas Kanavos ◽  
Christos Makris ◽  
Yannis Plegas ◽  
Evangelos Theodoridis

It is widely known that search engines are the dominating tools for finding information on the web. In most of the cases, these engines return web page references on a global ranking taking in mind either the importance of the web site or the relevance of the web pages to the identified topic. In this paper, we focus on the problem of determining distinct thematic groups on web search engine results that other existing engines provide. We additionally address the problem of dynamically adapting their ranking according to user selections, incorporating user judgments as implicitly registered in their selection of relevant documents. Our system exploits a state of the art semantic web data mining technique that identifies semantic entities of Wikipedia for grouping the result set in different topic groups, according to the various meanings of the provided query. Moreover, we propose a novel probabilistic Network scheme that employs the aforementioned topic identification method, in order to modify ranking of results as the users select documents. We evaluated in practice our implemented prototype with extensive experiments with the ClueWeb09 dataset using the TREC’s 2009, 2010, 2011 and 2012 Web Tracks’ where we observed improved retrieval performance compared to current state of the art re-ranking methods.


Sign in / Sign up

Export Citation Format

Share Document