scholarly journals Sentence Classification and Information Retrieval for Petroleum Engineering

2019 ◽  
Author(s):  
Thiago Ferraz ◽  
Gabriel Ferreira ◽  
Fábio Cozman ◽  
Ismael Santos

Classifying sentences in industrial, technical or scientific reports can enhance text mining and information retrieval tasks with useful machinereadable metadata. This paper describes a search engine that employs sentence classification so as to search for abstracts from scholarly papers in Petroleum Engineering. The sentences were classified into four classes, based on the popular IMRAD categories. We produced a dataset containing more than 2,200 manually labeled sentences from 278 scholarly articles in the field of Petroleum Engineering in order to be used as training and testing data. The classifier with best results was logistic regression, with an accuracy of 86.4%. The information retrieval system built on top of the classification system yielded a mAP of 0.80.

2016 ◽  
Vol 43 (3) ◽  
pp. 316-327 ◽  
Author(s):  
Mohammad Sadeghi ◽  
Jesús Vegas

The performance evaluation of an information retrieval system is a decisive aspect of the measure of the improvements in search technology. The Google search engine, as a tool for retrieving information on the Web, is used by almost 92% of Iranian users. The purpose of this paper is to study Google’s performance in retrieving relevant information from Persian documents. The information retrieval effectiveness is based on the precision measures of the search results done to a website that we have built with the documents of a TREC standard corpus. We asked Google for 100 topics available on the corpus and we compared the retrieved webpages with the relevant documents. The obtained results indicated that the morphological analysis of the Persian language is not fully taken into account by the Google search engine. The incorrect text tokenisation, considering the stop words as the content keywords of a document and the wrong ‘variants encountered’ of words found by Google are the main reasons that affect the relevance of the Persian information retrieval on the Web for this search engine.


Stemmer is used for reducing inflectional or derived word to its stem. This technique involves removing the suffix or prefix affixed in a word. It can be used for information retrieval system to refine the overall execution of the retrieval process. This process is not equivalent to morphological analysis. This process only finds the stem of a word. This technique decreases the number of terms in information retrieval system. There are various techniques exists for stemming. Here a new hybrid stemmer has developed named as “Mula” for Odia Language. It is a combination of brute force and enhanced suffix strippingapproach for Odia language. The new born stemmer is both computationally inexpensive and domain independent. We have integrated this stemmer in existing Dspace for Odia text retrieval System. The results are commendable and suggest that the new stemmer can be used effectively in Odia Search Engine. The proposed stemmer also handles over-stemming and understemming effectively


2016 ◽  
pp. 713-732
Author(s):  
Asmae Dami ◽  
Mohamed Fakir ◽  
Belaid Bouikhalene

This chapter is located in the intersection of two research themes, namely: Information Retrieval and Knowledge Discovery from texts (Text mining). The purpose of this paper is two-fold: first, it focuses on Information Retrieval (IR) whose purpose is to implement a set of models and systems for selecting a set of documents satisfying user needs in terms of information expressed as a query. An information retrieval system is composed mainly of two processes the representation and retrieval process. The process of representation is called indexing, which allows representation of documents and queries by descriptors, or indexes. These descriptors reflect the contents of documents. The retrieval process consists on the comparison between documents representations and query representation. The second aim of this paper is to discover the relationships between terms (keywords) descriptors of documents in a document database. The correlations (relationships) between terms are extracted by using a technique of the Text mining, mainly association rules.


2014 ◽  
Vol 7 (4) ◽  
pp. 42-62
Author(s):  
Asmae Dami ◽  
Mohamed Fakir ◽  
Belaid Bouikhalene

This paper is located in the intersection of two research themes, namely: Information Retrieval and Knowledge Discovery from texts (Text mining). The purpose of this paper is two-fold: first, it focuses on Information Retrieval (IR) whose purpose is to implement a set of models and systems for selecting a set of documents satisfying user needs in terms of information expressed as a query. An information retrieval system is composed mainly of two processes the representation and retrieval process. The process of representation is called indexing, which allows representation of documents and queries by descriptors, or indexes. These descriptors reflect the contents of documents. The retrieval process consists on the comparison between documents representations and query representation. The second aim of this paper is to discover the relationships between terms (keywords) descriptors of documents in a document database. The correlations (relationships) between terms are extracted by using a technique of the Text mining, mainly association rules.


2019 ◽  
Vol 6 (1) ◽  
pp. 41
Author(s):  
Faisal Rahutomo ◽  
Ariadi Retno Tri Hayati Ririd

<p class="Abstrak">Pada sistem temu kembali informasi berbentuk teks maupun <em>text mining</em>, terdapat proses pengindeksan. Teks diproses dengan tujuan mengintisarikan informasi berbentuk teks tersebut. Salah satu proses yang dilakukan adalah <em>stopword filtering</em>,<em> </em> beberapa kata yang tidak layak diindeks diabaikan berdasar sebuah daftar. Di dalam sistem berbahasa Indonesia, terdapat beberapa versi daftar <em>stopword</em> yang tersedia bebas. Penelitian ini bertujuan mengevaluasi daftar yang telah tersedia tersebut. Tujuan akhir dari penelitian ini adalah telaah daftar yang tersedia berdasarkan tata bahasa Indonesia, cara penyusunan, dan kebiasaan perambah internet. Dari hasil telaah diperoleh fakta bahwa daftar yang tersedia dibangun dengan analisis frekuensi kemunculan kata pada sebuah korpus (<em>corpus</em>) teks, tanpa memperhatikan jenis kata ataupun kebiasaan pengguna internet. Hasil lain penelitian ini  adalah beberapa rekomendasi lebih lanjut bagi para peneliti di bidang ini ketika membutuhkan daftar <em>stopword </em>bahasa Indonesia, yaitu daftar yang memperhatikan jenis kata dan kebiasaan pengguna internet melalui mesin perambah yang tersedia.</p><p class="Abstract"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>Most of text-based information retrieval system uses indexing process. The system processes the texts in order to obtain the information essence. One of the process is stopword filtering, several words are being ignored based on a stopword list. Several Indonesian stopword list are available openly. Therefore, this paper evaluates the available lists based on Indonesian formal grammar, its preparation technique, and internet surfer habit. The results show all of the list are developed by term frequency analysis based on a text corpus. This paper also provides several recommendations for researcher both in text mining and text-based information retrieval field, developing stoplist by the word type and internet surfer habit.</em></p>


Author(s):  
Haidar Moukdad ◽  
Andrew Large

When information seekers use an information retrieval system their strategy is based, at least in part, on the mental model they have constructed about this environment. A random sample was gathered of more than 2000 actual search queries submitted by users to one web search engine. WebCrawler, in two separate capture sessions. The results suggest that a high proportion of users do not employ advanced search features...


Sign in / Sign up

Export Citation Format

Share Document