Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

Seid Muhie Yimam; Abinew Ali Ayele; Gopalakrishnan Venkatesh; Ibrahim Gashaw; Chris Biemann

doi:10.3390/fi13110275

Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

Future Internet ◽

10.3390/fi13110275 ◽

2021 ◽

Vol 13 (11) ◽

pp. 275

Author(s):

Seid Muhie Yimam ◽

Abinew Ali Ayele ◽

Gopalakrishnan Venkatesh ◽

Ibrahim Gashaw ◽

Chris Biemann

Keyword(s):

Machine Learning ◽

Sentiment Classification ◽

Network Embedding ◽

Low Resource ◽

Pos Tagging ◽

Semitic Language ◽

Semantic Models ◽

Benchmark Datasets ◽

Fine Tune ◽

Better Than

The availability of different pre-trained semantic models has enabled the quick development of machine learning components for downstream applications. However, even if texts are abundant for low-resource languages, there are very few semantic models publicly available. Most of the publicly available pre-trained models are usually built as a multilingual version of semantic models that will not fit well with the need for low-resource languages. We introduce different semantic models for Amharic, a morphologically complex Ethio-Semitic language. After we investigate the publicly available pre-trained semantic models, we fine-tune two pre-trained models and train seven new different models. The models include Word2Vec embeddings, distributional thesaurus (DT), BERT-like contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and study their impact. We find that newly-trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from FLAIR and RoBERTa perform better than word2Vec models for the NER and POS tagging tasks. DT-based network embeddings are suitable for the sentiment classification task. We publicly release all the semantic models, machine learning components, and several benchmark datasets such as NER, POS tagging, sentiment classification, as well as Amharic versions of WordSim353 and SimLex999.

Download Full-text

A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3436818 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-23

Author(s):

Cunli Mao ◽

Zhibo Man ◽

Zhengtao Yu ◽

Shengxiang Gao ◽

Zhenhan Wang ◽

...

Keyword(s):

Error Propagation ◽

Joint Model ◽

Word Segmentation ◽

Joint Learning ◽

Pos Tagging ◽

Part Of Speech ◽

Proposed Model ◽

Syllable Segmentation ◽

Benchmark Datasets ◽

Fine Tune

The smallest semantic unit of the Burmese language is called the syllable. In the present study, it is intended to propose the first neural joint learning model for Burmese syllable segmentation, word segmentation, and part-of-speech ( POS ) tagging with the BERT. The proposed model alleviates the error propagation problem of the syllable segmentation. More specifically, it extends the neural joint model for Vietnamese word segmentation, POS tagging, and dependency parsing [28] with the pre-training method of the Burmese character, syllable, and word embedding with BiLSTM-CRF-based neural layers. In order to evaluate the performance of the proposed model, experiments are carried out on Burmese benchmark datasets, and we fine-tune the model of multilingual BERT. Obtained results show that the proposed joint model can result in an excellent performance.

Download Full-text

Component Thermodynamical Selection Based Gene Expression Programming for Function Finding

Mathematical Problems in Engineering ◽

10.1155/2014/915058 ◽

2014 ◽

Vol 2014 ◽

pp. 1-16 ◽

Cited By ~ 3

Author(s):

Zhaolu Guo ◽

Zhijian Wu ◽

Xiaojian Dong ◽

Kejun Zhang ◽

Shenwen Wang ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Selective Pressure ◽

Gene Expression Programming ◽

Population Diversity ◽

Minimal Free Energy ◽

Slow Convergence ◽

Benchmark Datasets ◽

Function Finding ◽

Better Than

Gene expression programming (GEP), improved genetic programming (GP), has become a popular tool for data mining. However, like other evolutionary algorithms, it tends to suffer from premature convergence and slow convergence rate when solving complex problems. In this paper, we propose an enhanced GEP algorithm, called CTSGEP, which is inspired by the principle of minimal free energy in thermodynamics. In CTSGEP, it employs a component thermodynamical selection (CTS) operator to quantitatively keep a balance between the selective pressure and the population diversity during the evolution process. Experiments are conducted on several benchmark datasets from the UCI machine learning repository. The results show that the performance of CTSGEP is better than the conventional GEP and some GEP variations.

Download Full-text

Exploring the Use of Machine Learning to Automate the Qualitative Coding of Church-related Tweets

Fieldwork in Religion ◽

10.1558/firn.40610 ◽

2020 ◽

Vol 14 (2) ◽

pp. 140-159

Author(s):

Anthony-Paul Cooper ◽

Emmanuel Awuni Kolog ◽

Erkki Sutinen

Keyword(s):

Machine Learning ◽

Online Community ◽

High Volume ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Social Media Data ◽

Twitter Data ◽

Resource Intensity ◽

Media Data ◽

Better Than

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.

Download Full-text

Identification of Anti-cancer Peptides Based on Multi-classifier System

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666191203141102 ◽

2020 ◽

Vol 22 (10) ◽

pp. 694-704 ◽

Cited By ~ 2

Author(s):

Wanben Zhong ◽

Bineng Zhong ◽

Hongbo Zhang ◽

Ziyi Chen ◽

Yan Chen

Keyword(s):

Machine Learning ◽

Side Effect ◽

Learning Models ◽

Normal Cells ◽

Classifier System ◽

Prediction Rate ◽

Anti Cancer ◽

Feature Information ◽

Machine Learning Models ◽

Better Than

Aim and Objective: Cancer is one of the deadliest diseases, taking the lives of millions every year. Traditional methods of treating cancer are expensive and toxic to normal cells. Fortunately, anti-cancer peptides (ACPs) can eliminate this side effect. However, the identification and development of new anti Materials and Methods: In our study, a multi-classifier system was used, combined with multiple machine learning models, to predict anti-cancer peptides. These individual learners are composed of different feature information and algorithms, and form a multi-classifier system by voting. Results and Conclusion: The experiments show that the overall prediction rate of each individual learner is above 80% and the overall accuracy of multi-classifier system for anti-cancer peptides prediction can reach 95.93%, which is better than the existing prediction model.

Download Full-text

Computing Possible Futures

10.1093/oso/9780198846420.001.0001 ◽

2019 ◽

Cited By ~ 2

Author(s):

William B. Rouse

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Computational Modeling ◽

Mental Model ◽

Data Analytics ◽

Computational Models ◽

Senior Managers ◽

Interactive Visualizations ◽

Use Of Models ◽

Better Than

This book discusses the use of models and interactive visualizations to explore designs of systems and policies in determining whether such designs would be effective. Executives and senior managers are very interested in what “data analytics” can do for them and, quite recently, what the prospects are for artificial intelligence and machine learning. They want to understand and then invest wisely. They are reasonably skeptical, having experienced overselling and under-delivery. They ask about reasonable and realistic expectations. Their concern is with the futurity of decisions they are currently entertaining. They cannot fully address this concern empirically. Thus, they need some way to make predictions. The problem is that one rarely can predict exactly what will happen, only what might happen. To overcome this limitation, executives can be provided predictions of possible futures and the conditions under which each scenario is likely to emerge. Models can help them to understand these possible futures. Most executives find such candor refreshing, perhaps even liberating. Their job becomes one of imagining and designing a portfolio of possible futures, assisted by interactive computational models. Understanding and managing uncertainty is central to their job. Indeed, doing this better than competitors is a hallmark of success. This book is intended to help them understand what fundamentally needs to be done, why it needs to be done, and how to do it. The hope is that readers will discuss this book and develop a “shared mental model” of computational modeling in the process, which will greatly enhance their chances of success.

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

Machine Learning for the Dynamic Positioning of UAVs for Extended Connectivity

Sensors ◽

10.3390/s21134618 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4618

Author(s):

Francisco Oliveira ◽

Miguel Luís ◽

Susana Sargento

Keyword(s):

Machine Learning ◽

Cellular Networks ◽

Real Data ◽

Emerging Technology ◽

Machine Learning Algorithms ◽

Base Stations ◽

Aerial Vehicle ◽

Positioning Algorithm ◽

The Military ◽

Better Than

Unmanned Aerial Vehicle (UAV) networks are an emerging technology, useful not only for the military, but also for public and civil purposes. Their versatility provides advantages in situations where an existing network cannot support all requirements of its users, either because of an exceptionally big number of users, or because of the failure of one or more ground base stations. Networks of UAVs can reinforce these cellular networks where needed, redirecting the traffic to available ground stations. Using machine learning algorithms to predict overloaded traffic areas, we propose a UAV positioning algorithm responsible for determining suitable positions for the UAVs, with the objective of a more balanced redistribution of traffic, to avoid saturated base stations and decrease the number of users without a connection. The tests performed with real data of user connections through base stations show that, in less restrictive network conditions, the algorithm to dynamically place the UAVs performs significantly better than in more restrictive conditions, reducing significantly the number of users without a connection. We also conclude that the accuracy of the prediction is a very important factor, not only in the reduction of users without a connection, but also on the number of UAVs deployed.

Download Full-text

A Survey on Causal Inference

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3444944 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-46

Author(s):

Liuyi Yao ◽

Zhixuan Chu ◽

Sheng Li ◽

Yaliang Li ◽

Jing Gao ◽

...

Keyword(s):

Machine Learning ◽

Causal Inference ◽

Observational Data ◽

Causal Effect ◽

Research Direction ◽

Estimation Methods ◽

Potential Outcome ◽

Outcome Framework ◽

Benchmark Datasets ◽

Inference Methods

Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy, and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well-known causal inference frameworks. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine, and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.

Download Full-text

Algorithmic and human prediction of success in human collaboration from visual features

Scientific Reports ◽

10.1038/s41598-021-81145-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Martin Saveski ◽

Edmond Awad ◽

Iyad Rahwan ◽

Manuel Cebrian

Keyword(s):

Machine Learning ◽

Visual Cues ◽

Success Factors ◽

Group Performance ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Adventure Game ◽

Group Success ◽

The Relationship ◽

Better Than

AbstractAs groups are increasingly taking over individual experts in many tasks, it is ever more important to understand the determinants of group success. In this paper, we study the patterns of group success in Escape The Room, a physical adventure game in which a group is tasked with escaping a maze by collectively solving a series of puzzles. We investigate (1) the characteristics of successful groups, and (2) how accurately humans and machines can spot them from a group photo. The relationship between these two questions is based on the hypothesis that the characteristics of successful groups are encoded by features that can be spotted in their photo. We analyze >43K group photos (one photo per group) taken after groups have completed the game—from which all explicit performance-signaling information has been removed. First, we find that groups that are larger, older and more gender but less age diverse are significantly more likely to escape. Second, we compare humans and off-the-shelf machine learning algorithms at predicting whether a group escaped or not based on the completion photo. We find that individual guesses by humans achieve 58.3% accuracy, better than random, but worse than machines which display 71.6% accuracy. When humans are trained to guess by observing only four labeled photos, their accuracy increases to 64%. However, training humans on more labeled examples (eight or twelve) leads to a slight, but statistically insignificant improvement in accuracy (67.4%). Humans in the best training condition perform on par with two, but worse than three out of the five machine learning algorithms we evaluated. Our work illustrates the potentials and the limitations of machine learning systems in evaluating group performance and identifying success factors based on sparse visual cues.

Download Full-text