scholarly journals Evaluation of domain adaptation approaches for robust classification of heterogeneous biological data sets

2019 ◽  
Author(s):  
Michael Schneider ◽  
Lichao Wang ◽  
Carsten Marr

AbstractMost machine learning algorithms require that training data are identically distributed to ensure effective learning. In biological studies, however, even small variations in the experimental setup can lead to substantial deviations. Domain adaptation offers tools to deal with this problem. It is particularly useful for cases where only a small amount of training data is available in the domain of interest, while a large amount of training data is available in a different, but relevant domain.We investigated to what extent domain adaptation was able to improve prediction accuracy for complex biological data. To that end, we used simulated data and time-lapse movies of differentiating blood stem cells in different cell cycle stages from multiple experiments and compared three commonly used domain adaptation approaches. EasyAdapt, a simple technique of structured pooling of related data sets, was able to improve accuracy when classifying the simulated data and cell cycle stages from microscopic images. Meanwhile, the technique proved robust to the potential negative impact on the classification accuracy that is common in other techniques that build models with heterogeneous data. Despite its implementation simplicity, EasyAdapt consistently produced more accurate predictions compared to conventional techniques.Domain adaptation is therefore able to substantially reduce the amount of work required to create a large amount of annotated training data in the domain of interest necessary whenever the domain changes even a little, which is common not only in biological experiments, but universally exists in almost all data collection routines.

Author(s):  
Fabian Schmich ◽  
Jack Kuipers ◽  
Gunter Merdes ◽  
Niko Beerenwinkel

Abstract In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene–gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Chunxiao Sun ◽  
Hongwei Huo ◽  
Qiang Yu ◽  
Haitao Guo ◽  
Zhigang Sun

The planted(l,d)motif search (PMS) is one of the fundamental problems in bioinformatics, which plays an important role in locating transcription factor binding sites (TFBSs) in DNA sequences. Nowadays, identifying weak motifs and reducing the effect of local optimum are still important but challenging tasks for motif discovery. To solve the tasks, we propose a new algorithm, APMotif, which first applies the Affinity Propagation (AP) clustering in DNA sequences to produce informative and good candidate motifs and then employs Expectation Maximization (EM) refinement to obtain the optimal motifs from the candidate motifs. Experimental results both on simulated data sets and real biological data sets show that APMotif usually outperforms four other widely used algorithms in terms of high prediction accuracy.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 825 ◽  
Author(s):  
Fadi Al Machot ◽  
Mohammed R. Elkobaisi ◽  
Kyandoghere Kyamakya

Due to significant advances in sensor technology, studies towards activity recognition have gained interest and maturity in the last few years. Existing machine learning algorithms have demonstrated promising results by classifying activities whose instances have been already seen during training. Activity recognition methods based on real-life settings should cover a growing number of activities in various domains, whereby a significant part of instances will not be present in the training data set. However, to cover all possible activities in advance is a complex and expensive task. Concretely, we need a method that can extend the learning model to detect unseen activities without prior knowledge regarding sensor readings about those previously unseen activities. In this paper, we introduce an approach to leverage sensor data in discovering new unseen activities which were not present in the training set. We show that sensor readings can lead to promising results for zero-shot learning, whereby the necessary knowledge can be transferred from seen to unseen activities by using semantic similarity. The evaluation conducted on two data sets extracted from the well-known CASAS datasets show that the proposed zero-shot learning approach achieves a high performance in recognizing unseen (i.e., not present in the training dataset) new activities.


2019 ◽  
Vol 33 (1) ◽  
pp. 3-12 ◽  
Author(s):  
Sean Kanuck

AbstractThe growing adoption of artificial intelligence (AI) raises questions about what comparative advantage, if any, human beings will have over machines in the future. This essay explores what it means to be human and how those unique characteristics relate to the digital age. Humor and ethics both rely upon higher-level cognition that accounts for unstructured and unrelated data. That capability is also vital to decision-making processes—such as jurisprudence and voting systems. Since machine learning algorithms lack the ability to understand context or nuance, reliance on them could lead to undesired results for society. By way of example, two case studies are used to illustrate the legal and moral considerations regarding the software algorithms used by driverless cars and lethal autonomous weapons systems. Social values must be encoded or introduced into training data sets if AI applications are to be expected to produce results similar to a “human in the loop.” There is a choice to be made, then, about whether we impose limitations on these new technologies in favor of maintaining human control, or whether we seek to replicate ethical reasoning and lateral thinking in the systems we create. The answer will have profound effects not only on how we interact with AI but also on how we interact with one another and perceive ourselves.


2019 ◽  
Vol 25 (5) ◽  
pp. 651-674 ◽  
Author(s):  
Katja Zupan ◽  
Nikola Ljubešić ◽  
Tomaž Erjavec

AbstractPart-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.


Author(s):  
Antonella Plaia ◽  
Simona Buscemi ◽  
Johannes Fürnkranz ◽  
Eneldo Loza Mencía

AbstractDecision tree learning is among the most popular and most traditional families of machine learning algorithms. While these techniques excel in being quite intuitive and interpretable, they also suffer from instability: small perturbations in the training data may result in big changes in the predictions. The so-called ensemble methods combine the output of multiple trees, which makes the decision more reliable and stable. They have been primarily applied to numeric prediction problems and to classification tasks. In the last years, some attempts to extend the ensemble methods to ordinal data can be found in the literature, but no concrete methodology has been provided for preference data. In this paper, we extend decision trees, and in the following also ensemble methods to ranking data. In particular, we propose a theoretical and computational definition of bagging and boosting, two of the best known ensemble methods. In an experimental study using simulated data and real-world datasets, our results confirm that known results from classification, such as that boosting outperforms bagging, could be successfully carried over to the ranking case.


Sentiment analysis, also known as Opinion Mining is one of the hottest topic Nowadays. in various social networking sites is one of the hottest topic and field nowadays. Here, we are using Twitter, the biggest web destinations for people to communicate with each other to perform the sentiment analysis and opinion mining by extracting the tweets by various users. The users can post brief text updates in twitter as it only allows 140 characters in one text message. Hashtags helps to search for tweets dealing with the specified subject. In previous researches, binary classification usually relies on the sentiment polarity(Positive , Negative and Neutral). The advantage is that multiple meaning of the same world might have different polarity, so it can be easily identified. In Multiclass classification, many tweets of one class are classified as if they belong to the others. The Neutral class presented the lowest precision in all the researches happened in this particular area. The set of tweets containing text and emoticon data will be classified into 13 classes. From each tweet, we extract different set of features using one hot encoding algorithm and use machine learning algorithms to perform classification. The entire tweets will be divided into training data sets and testing data sets. Training dataset will be pre-processed and classified using various Artificial Neural Network algorithms such as Reccurent Neural Network, Convolutional Neural Network etc. Moreover, the same procedure will be followed for the Text and Emoticon data. The developed model or system will be tested using the testing dataset. More precise and correct accuracy can be obtained or experienced using this multiclass classification of text and emoticons. 4 Key performance indicators will be used to evaluate the effectiveness of the corresponding approach.


2018 ◽  
Author(s):  
Lucas Bezerra Maia ◽  
Alan Carlos Lima ◽  
Pedro Thiago Cutrim Santos ◽  
Nigel da Silva Lima ◽  
João Dallyson Sousa De Almeida ◽  
...  

Melanoma is the most lethal type of skin cancer when compared to others, but patients have high recovery rates if the disease is discovered in its early stages. Several approaches to automatic detection and diagnosis have been explored by different authors. Training models with the existing data sets has been a difficult task due to the problem of imbalanced data. This work aims to evaluate the performance of machine learning algorithms combined with imbalanced learning techniques, regarding the task of melanoma diagnosis. Preliminary results have shown that features extracted with ResNet Convolutional Neural Network, along with Random Forest, achieved an improvement of sensibility of approximately 21%, after balancing the training data with Synthetic Minority Oversampling TEchnique (SMOTE) and Edited Nearest Neighbor (ENN) rule.


Author(s):  
Sotiris Kotsiantis ◽  
Dimitris Kanellopoulos ◽  
Panayotis Pintelas

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected tone can learn a way of classifying unseen examples (see Table 1). Formally, the problem can be stated as follows: Given training data {(x1, y1)…(xn, yn)}, produce a classifier h: X- >Y that maps an object x ? X to its classification label y ? Y. A large number of classification techniques have been developed based on artificial intelligence (logic-based techniques, perception-based techniques) and statistics (Bayesian networks, instance-based techniques). No single learning algorithm can uniformly outperform other algorithms over all data sets. The concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual machine learning algorithms. Numerous methods have been suggested for the creation of ensembles of classi- fiers (Dietterich, 2000). Although, or perhaps because, many methods of ensemble creation have been proposed, there is as yet no clear picture of which method is best.


Sign in / Sign up

Export Citation Format

Share Document