scholarly journals Targeting Prospective Customers: Robustness of Machine-Learning Methods to Typical Data Challenges

2020 ◽  
Vol 66 (6) ◽  
pp. 2495-2522 ◽  
Author(s):  
Duncan Simester ◽  
Artem Timoshenko ◽  
Spyros I. Zoumpoulis

We investigate how firms can use the results of field experiments to optimize the targeting of promotions when prospecting for new customers. We evaluate seven widely used machine-learning methods using a series of two large-scale field experiments. The first field experiment generates a common pool of training data for each of the seven methods. We then validate the seven optimized policies provided by each method together with uniform benchmark policies in a second field experiment. The findings not only compare the performance of the targeting methods, but also demonstrate how well the methods address common data challenges. Our results reveal that when the training data are ideal, model-driven methods perform better than distance-driven methods and classification methods. However, the performance advantage vanishes in the presence of challenges that affect the quality of the training data, including the extent to which the training data captures details of the implementation setting. The challenges we study are covariate shift, concept shift, information loss through aggregation, and imbalanced data. Intuitively, the model-driven methods make better use of the information available in the training data, but the performance of these methods is more sensitive to deterioration in the quality of this information. The classification methods we tested performed relatively poorly. We explain the poor performance of the classification methods in our setting and describe how the performance of these methods could be improved. This paper was accepted by Matthew Shum, marketing.

Logistics ◽  
2020 ◽  
Vol 4 (4) ◽  
pp. 35
Author(s):  
Sidharth Sankhye ◽  
Guiping Hu

The rising popularity of smart factories and Industry 4.0 has made it possible to collect large amounts of data from production stages. Thus, supervised machine learning methods such as classification can viably predict product compliance quality using manufacturing data collected during production. Elimination of uncertainty via accurate prediction provides significant benefits at any stage in a supply chain. Thus, early knowledge of product batch quality can save costs associated with recalls, packaging, and transportation. While there has been thorough research on predicting the quality of specific manufacturing processes, the adoption of classification methods to predict the overall compliance of production batches has not been extensively investigated. This paper aims to design machine learning based classification methods for quality compliance and validate the models via case study of a multi-model appliance production line. The proposed classification model could achieve an accuracy of 0.99 and Cohen’s Kappa of 0.91 for the compliance quality of unit batches. Thus, the proposed method would enable implementation of a predictive model for compliance quality. The case study also highlights the importance of feature construction and dataset knowledge in training classification models.


Animals ◽  
2020 ◽  
Vol 10 (5) ◽  
pp. 771
Author(s):  
Toshiya Arakawa

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.


2020 ◽  
Author(s):  
Abdur Rahman M. A. Basher ◽  
Steven J. Hallam

AbstractMachine learning methods show great promise in predicting metabolic pathways at different levels of biological organization. However, several complications remain that can degrade prediction performance including inadequately labeled training data, missing feature information, and inherent imbalances in the distribution of enzymes and pathways within a dataset. This class imbalance problem is commonly encountered by the machine learning community when the proportion of instances over class labels within a dataset are uneven, resulting in poor predictive performance for underrepresented classes. Here, we present leADS, multi-label learning based on active dataset subsampling, that leverages the idea of subsampling points from a pool of data to reduce the negative impact of training loss due to class imbalance. Specifically, leADS performs an iterative process to: (i)-construct an acquisition model in an ensemble framework; (ii) select informative points using an appropriate acquisition function; and (iii) train on selected samples. Multiple base learners are implemented in parallel where each is assigned a portion of labeled training data to learn pathways. We benchmark leADS using a corpora of 10 experimental datasets manifesting diverse multi-label properties used in previous pathway prediction studies, including manually curated organismal genomes, synthetic microbial communities and low complexity microbial communities. Resulting performance metrics equaled or exceeded previously reported machine learning methods for both organismal and multi-organismal genomes while establishing an extensible framework for navigating class imbalances across diverse real world datasets.Availability and implementationThe software package, and installation instructions are published on github.com/[email protected]


Author(s):  
Du Zhang

Software engineering research and practice thus far are primarily conducted in a value-neutral setting where each artifact in software development such as requirement, use case, test case, and defect, is treated as equally important during a software system development process. There are a number of shortcomings of such value-neutral software engineering. Value-based software engineering is to integrate value considerations into the full range of existing and emerging software engineering principles and practices. Machine learning has been playing an increasingly important role in helping develop and maintain large and complex software systems. However, machine learning applications to software engineering have been largely confined to the value-neutral software engineering setting. In this paper, the general message to be conveyed is to apply machine learning methods and algorithms to value-based software engineering. The training data or the background knowledge or domain theory or heuristics or bias used by machine learning methods in generating target models or functions should be aligned with stakeholders’ value propositions. An initial research agenda is proposed for machine learning in value-based software engineering.


2019 ◽  
Vol 35 (14) ◽  
pp. i31-i40 ◽  
Author(s):  
Erfan Sayyari ◽  
Ban Kawas ◽  
Siavash Mirarab

Abstract Motivation Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. Results In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. Availability and implementation TADA is available at https://github.com/tada-alg/TADA. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 8 (4) ◽  
pp. 7818-7823

Programming testing is a fundamental and essential advance of the existence cycle of programming improvement to recognize and defects in programming and afterward fix the deficiencies. The reliability of the data transmission or the quality of proper processing ,maintenance and retrieval of information to a server can be tested for some systems. Accuracy is also one factor that is usually used to the Joint Interoperability Test Command as a criterion for accessing interoperability. This is the main investigation of PC flaw forecast and exactness as per our examination, which spotlights on the utilization of PROMISE database dataset. Some PROMISE database dataset tests are compared between pseudo code (PYTHON) and actual software (WEKA),which in computer fault prediction and accuracy measurement are effective software metrics and machine learning methods.


2021 ◽  
Author(s):  
Qifei Zhao ◽  
Xiaojun Li ◽  
Yunning Cao ◽  
Zhikun Li ◽  
Jixin Fan

Abstract Collapsibility of loess is a significant factor affecting engineering construction in loess area, and testing the collapsibility of loess is costly. In this study, A total of 4,256 loess samples are collected from the north, east, west and middle regions of Xining. 70% of the samples are used to generate training data set, and the rest are used to generate verification data set, so as to construct and validate the machine learning models. The most important six factors are selected from thirteen factors by using Grey Relational analysis and multicollinearity analysis: burial depth、water content、specific gravity of soil particles、void rate、geostatic stress and plasticity limit. In order to predict the collapsibility of loess, four machine learning methods: Support Vector Machine (SVM), Random Subspace Based Support Vector Machine (RSSVM), Random Forest (RF) and Naïve Bayes Tree (NBTree), are studied and compared. The receiver operating characteristic (ROC) curve indicators, standard error (SD) and 95% confidence interval (CI) are used to verify and compare the models in different research areas. The results show that: RF model is the most efficient in predicting the collapsibility of loess in Xining, and its AUC average is above 80%, which can be used in engineering practice.


2020 ◽  
Vol 242 ◽  
pp. 05003
Author(s):  
A.E. Lovell ◽  
A.T. Mohan ◽  
P. Talou ◽  
M. Chertkov

As machine learning methods gain traction in the nuclear physics community, especially those methods that aim to propagate uncertainties to unmeasured quantities, it is important to understand how the uncertainty in the training data coming either from theory or experiment propagates to the uncertainty in the predicted values. Gaussian Processes and Bayesian Neural Networks are being more and more widely used, in particular to extrapolate beyond measured data. However, studies are typically not performed on the impact of the experimental errors on these extrapolated values. In this work, we focus on understanding how uncertainties propagate from input to prediction when using machine learning methods. We use a Mixture Density Network (MDN) to incorporate experimental error into the training of the network and construct uncertainties for the associated predicted quantities. Systematically, we study the effect of the size of the experimental error, both on the reproduced training data and extrapolated predictions for fission yields of actinides.


Sign in / Sign up

Export Citation Format

Share Document