COMPARISON OF VSM, GVSM, AND LSI IN INFORMATION RETRIEVAL FOR INDONESIAN TEXT

2016 ◽  
Vol 78 (5-6) ◽  
Author(s):  
Jasman Pardede ◽  
Milda Gustiana Husada

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-thread. Thread is implemented in preprocessing process in reading each document from document collection and stemming process either for query or documents. The quality of information retrieval performance is evaluated based-on time response, values of recall, precision, and F-measure were measured. The results show that for each method, the fastest execution time is .docx extension type file followed by .doc and .pdf. For the same document collection, the results show that time response for LSI is more faster, followed by GVSM then VSM. The average of recall value for VSM, GVSM and LSI are 82.86 %, 89.68 % and 84.93 % respectively. The average of precision value for VSM, GVSM and LSI are 64.08 %, 67.51 % and 62.08 % respectively. The average of F-measure value for VSM, GVSM and LSI are 71.95 %, 76.63 % and 71.02 % respectively. Implementation of multithread for preprocessing for VSM, GVSM, and LSI can increase average time response required is about 30.422%, 26.282%, and 31.821% respectively.  

2014 ◽  
Vol 70 (5) ◽  
Author(s):  
Thabit Sabbah ◽  
Ali Selamat

Thesaurus is used in many Information Retrieval (IR) applications such as data integration, data warehousing, semantic query processing and classifiers. It was also utilized to solve the problem of schema matching. Considering the fact of existence of many thesauri for a certain area of knowledge, the quality of schema matching results when using different thesauri in the same field is not predictable. In this paper, we propose a methodology to study the performance of the thesaurus in solving schema matching. The paper also presents results of experiments using different thesauri. Precision, recall, F-measure, and similarity average were calculated to show that the quality of matching changed according to the used thesaurus.  


2006 ◽  
Vol 05 (02) ◽  
pp. 97-105 ◽  
Author(s):  
S. Srinivas ◽  
Ch. AswaniKumar

Latent Semantic Indexing (LSI) is a famous Information Retrieval (IR) technique that tries to overcome the problems of lexical matching using conceptual indexing. LSI is a variant of vector space model and proved to be 30% more effective. Many studies have reported that good retrieval performance is related to the use of various retrieval heuristics. In this paper, we focus on optimising two LSI retrieval heuristics: term weighting and rank approximation. The results obtained demonstrate that the LSI performance improves significantly with the combination of optimised term weighting and rank approximation.


2021 ◽  
pp. 016555152199980
Author(s):  
Kyle Andrew Fitzgerald ◽  
Andre Charles de la Harpe ◽  
Corrie Susanna Uys

An information retrieval system (IRS) is used to retrieve documents based on an information need. The IRS makes relevance judgements by attempting to match a query to a document. As IRS capabilities are indexing design dependent, the hybrid indexing method (IRS-H) is introduced. The objectives of this article are to examine IRS-H (as an alternative indexing method that performs exact phrase matching) and IRS-I, regarding retrieval usefulness, identification of relevant documents, and the quality of rejecting irrelevant documents by conducting three experiments and by analysing the related data. Three experiments took place where a collection of 100 research documents and 75 queries were presented to: (1) five participants answering a questionnaire, (2) IRS-I to generate data and (3) IRS-H to generate data. The data generated during the experiments were statistically analysed using the performance measurements of Precision, Recall and Specificity, and one-tailed Student’s t-tests. The results reveal that IRS-H (1) increased the retrieval of relevant documents, (2) reduced incorrect identification of relevant documents and (3) increased the quality of rejecting irrelevant documents. The research found that the hybrid indexing method, using a small closed document collection of a hundred documents, produced the required outputs and that it may be used as an alternative IRS indexing method.


2019 ◽  
Author(s):  
rusda wajhillah ◽  
Agung Wibowo ◽  
Saeful Bahri

The quality of research needs to be directed and classified for improvement. A college roadmap must accordance interest and expertise from it lecturers. Therefore, be the duty of every college to create a strategic plan and pre-eminent research. Faculty in most all College has produced many scientific publications. Publication document of scientific papers is one example of unstructured documents. Its contents form of writing style, mostly defined by the author language. Generally, the document title only determined the maximum number of words. The main objective of the information retrieval system is to determine the documents keywords from the query provided by the user in a group of documents. TF/IDF Algorithm (Term Frequency – Inversed Document Frequency) and the Vector Space Model algorithm is several methods of the algorithm that can utilize on text mining in analysing phases as options document classification determination-based solutions words that often appear on the document title. This paper can help decision makers to determine, assess, adapt research roadmap to College. The depiction of a tree model using long-term roadmap makes it easier to read and understand. [Kualitas penelitian perlu diarahkan dan diklasifikasikan untuk perbaikan. Roadmap perguruan tinggi harus sesuai dengan minat dan keahlian dari dosen. Karena itu, jadilah tugas setiap perguruan tinggi untuk membuat rencana strategis dan penelitian unggulan. Fakultas - fakultas di hampir semua perguruan tinggi telah menghasilkan banyak publikasi ilmiah. Dokumen publikasi karya ilmiah adalah salah satu contoh dokumen tidak terstruktur. Isinya berupa gaya penulisan, sebagian besar ditentukan oleh bahasa penulis. Secara umum, judul dokumen hanya menentukan jumlah kata maksimum. Tujuan utama dari sistem pencarian informasi adalah untuk menentukan kata kunci dokumen dari permintaan yang diberikan oleh pengguna dalam sekelompok dokumen. Algoritma TF / IDF (TermFrequency - Inversed Document Frequency) dan algoritma Vector Space Model adalah beberapa metode algoritma yang dapat digunakan pada penambangan teks dalam menganalisis fase sebagai opsi dokumen klasifikasi penentuan kata-kata solusi berdasarkan solusi yang sering muncul pada judul dokumen. Makalah ini dapat membantu para pembuat keputusan untuk menentukan, menilai, mengadaptasi peta jalan penelitian ke perguruan tinggi. Penggambaran model pohon menggunakan peta jalan jangka panjang membuatnya lebih mudah dibaca dan dipahami.]


Author(s):  
Didit Suhartono ◽  
Khodirun Khodirun

The archive is one of the examples of documents that important. Archives are stored systematically with a view to helping and simplifying the storage and retrieval of the archive. In the information retrieval (Information retrieval) the process of retrieving relevant documents and not retrieving documents that are not relevant. To retrieve the relevant documents, a method is needed. Using the Term Frequency-Inverse Document and Vector Space Model methods can find relevant documents according to the level of closeness or similarity, in addition to applying the Nazief-Adriani stemming algorithm can improve information retrieval performance by transforming words in a document or text to the basic word form. then the system indexes the document to simplify and speed up the search process. Relevance is determined by calculating the similarity values between existing documents by querying and represented in certain forms. The documents obtained, then the system sort by the level of relevance to the query.


Author(s):  
Hilton H. Mollenhauer

Many factors (e.g., resolution of microscope, type of tissue, and preparation of sample) affect electron microscopical images and alter the amount of information that can be retrieved from a specimen. Of interest in this report are those factors associated with the evaluation of epoxy embedded tissues. In this context, informational retrieval is dependant, in part, on the ability to “see” sample detail (e.g., contrast) and, in part, on tue quality of sample preservation. Two aspects of this problem will be discussed: 1) epoxy resins and their effect on image contrast, information retrieval, and sample preservation; and 2) the interaction between some stains commonly used for enhancing contrast and information retrieval.


Author(s):  
Anthony Anggrawan ◽  
Azhari

Information searching based on users’ query, which is hopefully able to find the documents based on users’ need, is known as Information Retrieval. This research uses Vector Space Model method in determining the similarity percentage of each student’s assignment. This research uses PHP programming and MySQL database. The finding is represented by ranking the similarity of document with query, with mean average precision value of 0,874. It shows how accurate the application with the examination done by the experts, which is gained from the evaluation with 5 queries that is compared to 25 samples of documents. If the number of counted assignments has higher similarity, thus the process of similarity counting needs more time, it depends on the assignment’s number which is submitted.


2018 ◽  
Vol 9 (2) ◽  
pp. 97-105
Author(s):  
Richard Firdaus Oeyliawan ◽  
Dennis Gunawan

Library is one of the facilities which provides information, knowledge resource, and acts as an academic helper for readers to get the information. The huge number of books which library has, usually make readers find the books with difficulty. Universitas Multimedia Nusantara uses the Senayan Library Management System (SLiMS) as the library catalogue. SLiMS has many features which help readers, but there is still no recommendation feature to help the readers finding the books which are relevant to the specific book that readers choose. The application has been developed using Vector Space Model to represent the document in vector model. The recommendation in this application is based on the similarity of the books description. Based on the testing phase using one-language sample of the relevant books, the F-Measure value gained is 55% using 0.1 as cosine similarity threshold. The books description and variety of languages affect the F-Measure value gained. Index Terms—Book Recommendation, Porter Stemmer, SLiMS Universitas Multimedia Nusantara, TF-IDF, Vector Space Model


Sign in / Sign up

Export Citation Format

Share Document