Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies

Computational and Mathematical Methods in Medicine ◽

10.1155/2017/7847531 ◽

2017 ◽

Vol 2017 ◽

pp. 1-18 ◽

Cited By ~ 4

Author(s):

Norbert Krautenbacher ◽

Fabian J. Theis ◽

Christiane Fuchs

Keyword(s):

Machine Learning ◽

Random Forest ◽

Selection Bias ◽

Sample Selection ◽

Case Control ◽

Machine Learning Techniques ◽

Sample Selection Bias ◽

Case Control Studies ◽

Two Phase ◽

Inverse Probability

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R packagesambia.

Download Full-text

Prediction of Short-Distance Aerial Movement of Phakopsora pachyrhizi Urediniospores Using Machine Learning

Phytopathology ◽

10.1094/phyto-04-17-0138-fi ◽

2017 ◽

Vol 107 (10) ◽

pp. 1187-1198 ◽

Cited By ~ 7

Author(s):

L. Wen ◽

C. R. Bowen ◽

G. L. Hartman

Keyword(s):

Machine Learning ◽

Random Forest ◽

Short Distance ◽

Soybean Rust ◽

Machine Learning Techniques ◽

Phakopsora Pachyrhizi ◽

Primary Means ◽

Soybean Plants ◽

Selection Operator ◽

Active Trap

Dispersal of urediniospores by wind is the primary means of spread for Phakopsora pachyrhizi, the cause of soybean rust. Our research focused on the short-distance movement of urediniospores from within the soybean canopy and up to 61 m from field-grown rust-infected soybean plants. Environmental variables were used to develop and compare models including the least absolute shrinkage and selection operator regression, zero-inflated Poisson/regular Poisson regression, random forest, and neural network to describe deposition of urediniospores collected in passive and active traps. All four models identified distance of trap from source, humidity, temperature, wind direction, and wind speed as the five most important variables influencing short-distance movement of urediniospores. The random forest model provided the best predictions, explaining 76.1 and 86.8% of the total variation in the passive- and active-trap datasets, respectively. The prediction accuracy based on the correlation coefficient (r) between predicted values and the true values were 0.83 (P < 0.0001) and 0.94 (P < 0.0001) for the passive and active trap datasets, respectively. Overall, multiple machine learning techniques identified the most important variables to make the most accurate predictions of movement of P. pachyrhizi urediniospores short-distance.

Download Full-text

Sample selection bias in an international DNA panel: Does Native American haplogroup Q-M3 has the b2/b3 deletion?

Genomics ◽

10.1016/j.ygeno.2015.02.004 ◽

2015 ◽

Vol 105 (5-6) ◽

pp. 273-274

Author(s):

Evguenia Alechine ◽

Daniel Corach

Keyword(s):

Native American ◽

Selection Bias ◽

Sample Selection ◽

Sample Selection Bias ◽

Haplogroup Q

Download Full-text

Sample Selection Bias, Return Moments, and the Performance of Optimal versus Naive Diversification

SSRN Electronic Journal ◽

10.2139/ssrn.2825885 ◽

2016 ◽

Cited By ~ 1

Author(s):

Bowei Li

Keyword(s):

Selection Bias ◽

Sample Selection ◽

Sample Selection Bias ◽

Naïve Diversification

Download Full-text

Correcting Sample Selection Bias by Unlabeled Data

Advances in Neural Information Processing Systems 19 ◽

10.7551/mitpress/7503.003.0080 ◽

2007 ◽

Cited By ~ 1

Keyword(s):

Selection Bias ◽

Sample Selection ◽

Unlabeled Data ◽

Sample Selection Bias

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Heart Disease Prediction using Machine Learning Techniques

International Journal of Scientific Research in Science and Technology ◽

10.32628/ijsrst2183218 ◽

2021 ◽

pp. 42-47

Author(s):

Ramesh Ponnala ◽

K. Sai Sowjanya

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Linear Model ◽

Machine Learning Techniques ◽

Disease Prediction ◽

Huge Amount ◽

Healthcare Enterprise ◽

Learning Techniques ◽

Accuracy Level

Prediction of Cardiovascular ailment is an important task inside the vicinity of clinical facts evaluation. Machine learning knowledge of has been proven to be effective in helping in making selections and predicting from the huge amount of facts produced by using the healthcare enterprise. on this paper, we advocate a unique technique that pursuits via finding good sized functions by means of applying ML strategies ensuing in improving the accuracy inside the prediction of heart ailment. The severity of the heart disease is classified primarily based on diverse methods like KNN, choice timber and so on. The prediction version is added with special combos of capabilities and several known classification techniques. We produce a stronger performance level with an accuracy level of a 100% through the prediction version for heart ailment with the Hybrid Random forest area with a linear model (HRFLM).

Download Full-text

Human Activity Recognition using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35694 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 3553-3556

Author(s):

Chaudhari Shraddha

Keyword(s):

Machine Learning ◽

Random Forest ◽

Activity Recognition ◽

Human Activity ◽

Recognition System ◽

Human Activity Recognition ◽

Machine Learning Techniques ◽

K Nearest Neighbors ◽

Random Forest Classification ◽

Medical Health Care

Activity recognition in humans is one of the active challenges that find its application in numerous fields such as, medical health care, military, manufacturing, assistive techniques and gaming. Due to the advancements in technologies the usage of smartphones in human lives has become inevitable. The sensors in the smartphones help us to measure the essential vital parameters. These measured parameters enable us to monitor the activities of humans, which we call as human activity recognition. We have applied machine learning techniques on a publicly available dataset. K-Nearest Neighbors and Random Forest classification algorithms are applied. In this paper, we have designed and implemented an automatic human activity recognition system that independently recognizes the actions of the humans. This system is able to recognize the activities such as Laying, Sitting, Standing, Walking, Walking downstairs and Walking upstairs. The results obtained show that, the KNN and Random Forest Algorithms gives 90.22% and 92.70% respectively of overall accuracy in detecting the activities.

Download Full-text

Heterogeneous Causal Effects and Sample Selection Bias

Sociological Science ◽

10.15195/v2.a17 ◽

2015 ◽

Vol 2 ◽

pp. 351-369 ◽

Cited By ~ 28

Author(s):

Richard Breen ◽

Seungsoo Choi ◽

Anders Holm

Keyword(s):

Selection Bias ◽

Sample Selection ◽

Causal Effects ◽

Sample Selection Bias

Download Full-text

Preliminary Screening of COVID-19 Infection Employing Machine Learning Techniques From Simple Blood Profile

International Journal of Quantitative Structure-Property Relationships ◽

10.4018/ijqspr.2021070103 ◽

2021 ◽

Vol 6 (3) ◽

pp. 35-47

Author(s):

Anirudh Reddy Cingireddy ◽

Robin Ghosh ◽

Supratik Kar ◽

Venkata Melapu ◽

Sravanthi Joginipeli ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Albert Einstein ◽

Naïve Bayes ◽

Machine Learning Techniques ◽

Support Vector ◽

Blood Profile ◽

Molecular Tests ◽

Large Populations

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.

Download Full-text

Machine Learning Enhancing Adaptivity of Multimodal Mobile Systems

Machine Learning ◽

10.4018/978-1-60960-818-7.ch414 ◽

2012 ◽

pp. 969-985

Author(s):

Floriana Esposito ◽

Teresa M.A. Basile ◽

Nicola Di Mauro ◽

Stefano Ferilli

Keyword(s):

Machine Learning ◽

Mobile Device ◽

Pattern Discovery ◽

User Interaction ◽

User Model ◽

Mobile Systems ◽

Machine Learning Techniques ◽

Two Phase ◽

Learning Techniques ◽

First Time

One of the most important features of a mobile device concerns its flexibility and capability to adapt the functionality it provides to the users. However, the main problems of the systems present in literature are their incapability to identify user needs and, more importantly, the insufficient mappings of those needs to available resources/services. In this paper, we present a two-phase construction of the user model: firstly, an initial static user model is built for the user connecting to the system the first time. Then, the model is revised/adjusted by considering the information collected in the logs of the user interaction with the device/context in order to make the model more adequate to the evolving user’s interests/ preferences/behaviour. The initial model is built by exploiting the stereotype concept, its adjustment is performed exploiting machine learning techniques and particularly, sequence mining and pattern discovery strategies.

Download Full-text