Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

Database ◽

10.1093/database/baaa006 ◽

2020 ◽

Vol 2020 ◽

Cited By ~ 2

Author(s):

Valerio Arnaboldi ◽

Daniela Raciti ◽

Kimberly Van Auken ◽

Juancarlos N Chan ◽

Hans-Michael Müller ◽

...

Keyword(s):

Text Mining ◽

De Novo ◽

Research Literature ◽

Support Vector ◽

Data Types ◽

Named Entities ◽

Community Curation ◽

Biological Entities ◽

Machine Readable ◽

New System

Abstract Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.

Download Full-text

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Knowledge and Information Systems ◽

10.1007/s10115-012-0502-0 ◽

2012 ◽

Vol 35 (1) ◽

pp. 87-109 ◽

Cited By ~ 5

Author(s):

César de Pablo-Sánchez ◽

Isabel Segura-Bedmar ◽

Paloma Martínez ◽

Ana Iglesias-Maqueda

Keyword(s):

Text Mining ◽

Named Entities ◽

Linguistic Patterns ◽

Multilingual Text

Download Full-text

Drill-Core Mineral Abundance Estimation Using Hyperspectral and High-Resolution Mineralogical Data

Remote Sensing ◽

10.3390/rs12071218 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1218

Author(s):

Laura Tuşa ◽

Mahdi Khodadadzadeh ◽

Cecilia Contreras ◽

Kasra Rafiezadeh Shahi ◽

Margret Fuchs ◽

...

Keyword(s):

Machine Learning ◽

High Resolution ◽

Ore Deposits ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Drill Core ◽

Data Types ◽

Mineralogical Characterization ◽

Core Samples

Due to the extensive drilling performed every year in exploration campaigns for the discovery and evaluation of ore deposits, drill-core mapping is becoming an essential step. While valuable mineralogical information is extracted during core logging by on-site geologists, the process is time consuming and dependent on the observer and individual background. Hyperspectral short-wave infrared (SWIR) data is used in the mining industry as a tool to complement traditional logging techniques and to provide a rapid and non-invasive analytical method for mineralogical characterization. Additionally, Scanning Electron Microscopy-based image analyses using a Mineral Liberation Analyser (SEM-MLA) provide exhaustive high-resolution mineralogical maps, but can only be performed on small areas of the drill-cores. We propose to use machine learning algorithms to combine the two data types and upscale the quantitative SEM-MLA mineralogical data to drill-core scale. This way, quasi-quantitative maps over entire drill-core samples are obtained. Our upscaling approach increases result transparency and reproducibility by employing physical-based data acquisition (hyperspectral imaging) combined with mathematical models (machine learning). The procedure is tested on 5 drill-core samples with varying training data using random forests, support vector machines and neural network regression models. The obtained mineral abundance maps are further used for the extraction of mineralogical parameters such as mineral association.

Download Full-text

Shape-restricted support vector machine (SR-SVM): a SVM classifier taking supplementary shape information of input

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202155 ◽

2021 ◽

Vol 40 (1) ◽

pp. 1481-1494

Author(s):

Geng Deng ◽

Yaoguo Xie ◽

Xindong Wang ◽

Qiang Fu

Keyword(s):

Support Vector Machine ◽

Classification Performance ◽

Research Literature ◽

Support Vector ◽

Svm Classifier ◽

Classification Problems ◽

Active Set ◽

Shape Information ◽

Convex Optimization Problem ◽

Shape Restrictions

Many classification problems contain shape information from input features, such as monotonic, convex, and concave. In this research, we propose a new classifier, called Shape-Restricted Support Vector Machine (SR-SVM), which takes the component-wise shape information to enhance classification accuracy. There exists vast research literature on monotonic classification covering monotonic or ordinal shapes. Our proposed classifier extends to handle convex and concave types of features, and combinations of these types. While standard SVM uses linear separating hyperplanes, our novel SR-SVM essentially constructs non-parametric and nonlinear separating planes subject to component-wise shape restrictions. We formulate SR-SVM classifier as a convex optimization problem and solve it using an active-set algorithm. The approach applies basis function expansions on the input and effectively utilizes the standard SVM solver. We illustrate our methodology using simulation and real world examples, and show that SR-SVM improves the classification performance with additional shape information of input.

Download Full-text

Support Vector Machine VS Information Gain: Analisis Sentimen Cyberbullying di Twitter Indonesia

Jurnal ULTIMA InfoSys ◽

10.31937/si.v11i2.1740 ◽

2020 ◽

Vol 11 (2) ◽

pp. 107-111

Author(s):

Christevan Destitus ◽

Wella Wella ◽

Suryasari Suryasari

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Mining ◽

Information Gain ◽

Text Processing ◽

Support Vector ◽

Term Weighting ◽

System Process ◽

Research Stage

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.

Download Full-text

Exploring Automated Text Classification to Improve Keyword Corpus Search Results for Bioinspired Design

Journal of Mechanical Design ◽

10.1115/1.4028167 ◽

2014 ◽

Vol 136 (11) ◽

Cited By ~ 8

Author(s):

Michael W. Glier ◽

Daniel A. McAdams ◽

Julie S. Linsey

Keyword(s):

Text Mining ◽

Text Classification ◽

Keyword Search ◽

Idea Generation ◽

Support Vector ◽

Biological Knowledge ◽

Svm Classifier ◽

Search Results ◽

Bioinspired Design ◽

Mining Algorithms

Bioinspired design is the adaptation of methods, strategies, or principles found in nature to solve engineering problems. One formalized approach to bioinspired solution seeking is the abstraction of the engineering problem into a functional need and then seeking solutions to this function using a keyword type search method on text based biological knowledge. These function keyword search approaches have shown potential for success, but as with many text based search methods, they produce a large number of results, many of little relevance to the problem in question. In this paper, we develop a method to train a computer to identify text passages more likely to suggest a solution to a human designer. The work presented examines the possibility of filtering biological keyword search results by using text mining algorithms to automatically identify which results are likely to be useful to a designer. The text mining algorithms are trained on a pair of surveys administered to human subjects to empirically identify a large number of sentences that are, or are not, helpful for idea generation. We develop and evaluate three text classification algorithms, namely, a Naïve Bayes (NB) classifier, a k nearest neighbors (kNN) classifier, and a support vector machine (SVM) classifier. Of these methods, the NB classifier generally had the best performance. Based on the analysis of 60 word stems, a NB classifier's precision is 0.87, recall is 0.52, and F score is 0.65. We find that word stem features that describe a physical action or process are correlated with helpful sentences. Similarly, we find biological jargon feature words are correlated with unhelpful sentences.

Download Full-text

Algorithmic and data modeling: Will algorithmic modeling improve predictions of traits evaluated on ordinal scales?

10.1101/2020.10.07.329466 ◽

2020 ◽

Author(s):

Zhanyou Xu ◽

Andreomar Kurek ◽

Steven B. Cannon ◽

Williams D. Beavis

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Ridge Regression ◽

Genomic Prediction ◽

Ordinal Data ◽

Prediction Models ◽

Characteristic Curve ◽

Gradient Boosting ◽

Support Vector ◽

Data Types

AbstractSelection of markers linked to alleles at quantitative trait loci (QTL) for tolerance to Iron Deficiency Chlorosis (IDC) has not been successful. Genomic selection has been advocated for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, genomic prediction models have not been systematically compared. The objectives of research reported in this manuscript were to evaluate the most commonly used genomic prediction method, ridge regression and it’s equivalent logistic ridge regression method, with algorithmic modeling methods including random forest, gradient boosting, support vector machine, K-nearest neighbors, Naïve Bayes, and artificial neural network using the usual comparator metric of prediction accuracy. In addition we compared the methods using metrics of greater importance for decisions about selecting and culling lines for use in variety development and genetic improvement projects. These metrics include specificity, sensitivity, precision, decision accuracy, and area under the receiver operating characteristic curve. We found that Support Vector Machine provided the best specificity for culling IDC susceptible lines, while Random Forest GP models provided the best combined set of decision metrics for retaining IDC tolerant and culling IDC susceptible lines.

Download Full-text

iBitter-Fuse: A Novel Sequence-Based Bitter Peptide Predictor by Fusing Multi-View Features

International Journal of Molecular Sciences ◽

10.3390/ijms22168958 ◽

2021 ◽

Vol 22 (16) ◽

pp. 8958

Author(s):

Phasit Charoenkwan ◽

Chanin Nantasenamat ◽

Md. Mehedi Hasan ◽

Mohammad Ali Moni ◽

Pietro Lio’ ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

De Novo ◽

Predictive Performance ◽

Support Vector ◽

Sufficient Information ◽

Self Assessment ◽

Accurate Identification ◽

Bitter Peptides ◽

Accurate Performance

Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides

Download Full-text

Firm's default — new methodological approach and preliminary evidence from Poland

Equilibrium ◽

10.24136/eq.v12i4.39 ◽

2017 ◽

Vol 12 (4) ◽

pp. 753-773

Author(s):

Tomasz Berent ◽

Bogusław Bławat ◽

Marek Dietl ◽

Przemysław Krzyk ◽

Radosław Rejman

Keyword(s):

Financial Distress ◽

Prediction Accuracy ◽

Predictive Power ◽

Methodological Approach ◽

Value Added ◽

Research Literature ◽

Support Vector ◽

Accuracy Measurement ◽

Accuracy Ratio ◽

Sampling Procedures

Research background: Bankruptcy literature is populated with scores of (econometric) models ranging from Altman’s Z-score, Ohlson’s O-score, Zmijewski’s probit model to k-nearest neighbors, classification trees, support vector machines, mathematical programming, evolutionary algorithms or neural networks, all designed to predict financial distress with highest precision. We believe corporate default is also an important research topic to be identified with the prediction accuracy only. Despite the wealth of modelling effort, a unified theory of default is yet to be proposed. Purpose of the article: Due to the disagreement both on the definition and hence the timing of default, as well as on the measurement of prediction accuracy, the comparison (of predictive power) of various models can be seriously misleading. The purpose of the article is to argue for the shift in research focus from maximizing accuracy to the analysis of the information capacity of predictors. By doing this, we may yet come closer to understanding default itself. Methods: We critically appraise the bankruptcy research literature for its methodological variety and empirical findings. Default definitions, sampling procedures, in and out-of-sample testing and accuracy measurement are all scrutinized. In an empirical part, we use a double stochastic Poisson process with multi-period prediction horizon and a comprehensive database of some 15,000 Polish non-listed companies to illustrate the merits of our new approach to default modelling. Findings & Value added: In the theoretical part, we call for the construction of a single unified default forecasting platform estimated for the largest dataset of firms possible to allow testing the utility of various sources of micro, mezzo, and macro information. Our preliminary empirical evidence is encouraging. The accuracy ratio amounts to 0.92 for t = 0 and drops to 0.81 two years ahead of default. We point to the pivotal role played by the information on firm’s liquidity (alternatively in profitability) and — in contrast to Altman’s tradition — hardly any contribution to predictive power of other financial ratios. Macro data is shown to be critical. It adds, on average, more than 10 p.p. to accuracy ratio. In the future, we hope to integrate listed and non-listed firms data into one model, ideally at higher frequency than annual, and include the information on firm's competitiveness position.

Download Full-text

sPepFinder expedites genome-wide identification of small proteins in bacteria

10.1101/2020.05.05.079178 ◽

2020 ◽

Author(s):

Lei Li ◽

Yanjie Chao

Keyword(s):

De Novo ◽

Bacterial Species ◽

Computational Prediction ◽

Ribosome Profiling ◽

Support Vector ◽

Initiation Rate ◽

E Coli ◽

Small Proteins ◽

Genome Wide ◽

A Genome

ABSTRACTSmall proteins shorter than 50 amino acids have been long overlooked. A number of small proteins have been identified in several model bacteria using experimental approaches and assigned important functions in diverse cellular processes. The recent development of ribosome profiling technologies has allowed a genome-wide identification of small proteins and small ORFs (smORFs), but our incomplete understanding of small proteins hinders de novo computational prediction of smORFs in non-model bacterial species. Here, we have identified several sequence features for smORFs by a systematic analysis of all the known small proteins in E. coli, among which the translation initiation rate is the strongest determinant. By integrating these features into a support vector machine learning model, we have developed a novel sPepFinder algorithm that can predict conserved smORFs in bacterial genomes with a high accuracy of 92.8%. De novo prediction in E. coli has revealed several novel smORFs with evidence of translation supported by ribosome profiling. Further application of sPepFinder in 549 bacterial species has led to the identification of > 100,000 novel smORFs, many of which are conserved at the amino acid and nucleotide levels under purifying selection. Overall, we have established sPepFinder as a valuable tool to identify novel smORFs in both model and non-model bacterial organisms, and provided a large resource of small proteins for functional characterizations.

Download Full-text

Comparative Analysis for Topic Classification in Juz Al-Baqarah

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v12.i1.pp406-411 ◽

2018 ◽

Vol 12 (1) ◽

pp. 406

Author(s):

Mohamad Izzuddin Rahman ◽

Noor Azah Samsudin ◽

Aida Mustapha ◽

Adeleke Abdullahi

Keyword(s):

Text Mining ◽

Text Categorization ◽

Arabic Language ◽

Support Vector ◽

Original Text ◽

Research Project ◽

Computational Environment ◽

Nearest Neighbours ◽

Association Discovery ◽

Relationship Of

<p>In Islam, Quran is the holy book that was revealed to the Prophet Muhammad. It functions as complete code of life for the Muslims. Remarks from Allah which contains more than 77,000 words that was passed down through Prophet Muhammad to the mankind for 23 years started in 610 ce. The Quran was divided into 114 chapters. Arabic language is the original text. The need for the Muslims across the world to find the meaning to understand the content in the Quran is necessary. Nevertheless, understanding the Quran is an interest for the Muslims as well as the attention of millions of people from the faiths. Following the generation, lots of content that related to the Quran has been broadcast by Muslims scholars in the way of the tafsirs, translation and the book of hadiths. Problem has happened at current is most Muslim in Malaysia do not understand sentences in the Quran due to language barrier. The purpose of this research is classified topic in each verses of the Quran sentence based on its specific theme. It involves the objective of text mining which are based on linguistic information and domain. The usage of corpus helps to perform various data mining tasks including information extraction, text categorization, the relationship of concepts, association discovery, the evaluation of pattern and assessed. This research project is aiming to create computing environment that enable us use to text mining the Quran. The classification experiment is using the Support Vector Machine to find themes in Juz’ Baqarah. The SVM performance is then compared against other classification algorithms such as Naive Bayes, J48 Decision Tree and K-Nearest Neighbours. This research project aims at creating an enabling computational environment for text mining the Qur’an and to facilitate users to understand every verse in Juz’ Baqarah.</p>

Download Full-text