scholarly journals Predicting Math Student Success in the Initial Phase of College With Sparse Information Using Approaches From Statistical Learning

2020 ◽  
Vol 5 ◽  
Author(s):  
Pascal Kilian ◽  
Frank Loose ◽  
Augustin Kelava

In math teacher education, dropout research relies mostly on frameworks which carry out extensive variable collections leading to a lack of practical applicability. We investigate the completion of a first semester course as a dropout indicator and thereby provide not only good predictions, but also generate interpretable and practicable results together with easy-to-understand recommendations. As proof-of-concept, a sparse feature space together with machine learning methods is used for prediction of dropout, wherein the most predictive features have to be identified. Interpretability can be reached by introducing risk groups for the students. Implications for interventions are discussed.

2020 ◽  
Author(s):  
Trang T. Le ◽  
Jason H. Moore

AbstractSummarytreeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students’ understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.AvailabilityThe treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous [email protected]


2019 ◽  
Author(s):  
Leila Mirsadeghi ◽  
Ali Mohammad Banaei-Moghaddam ◽  
Seyed Reza Beh-Afarin ◽  
Reza Haji Hosseini ◽  
Kaveh Kavousi

Abstract Background: Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance.Methods: This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study is to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1. identify overarching themes emerging from empirical studies as regards EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures.Results: By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values.Conclusions: To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.


2019 ◽  
Author(s):  
Kaveh Kavousi ◽  
Leila Mirsadeghi ◽  
Reza Haji Hosseini ◽  
Ali Mohammad Banaei-Moghaddam ◽  
Seyed Reza Beh-Afarin

Abstract Background Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance. Methods This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study was to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1. identify overarching themes emerging from empirical studies regarding EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures. Results By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values. Conclusions To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.


2021 ◽  
Vol 2042 (1) ◽  
pp. 012027
Author(s):  
Lorenzo Salmina ◽  
Roberto Castello ◽  
Justine Stoll ◽  
Jean-Louis Scartezzini

Abstract A timely identification of an anomalous functioning of the energy system of an industrial building would increase the efficiency and the resilience of the energy infrastructure, beside reducing the economic wastage. This work has been inspired by the need of identifying, for a series of supermarket buildings in Switzerland, the failures happening in their heating systems across the years in an unsupervised and easy-to-visualize fashion for the building managers. The lack of any a-priori label differentiating between typical and anomalous behaviors calls for the usage of unsupervised machine learning methods to extract the relevant features to describe the system operations, to reduce the dimension of the feature space, and to cluster together similar patterns of operations. The method is validated on a standard supermarket building, where it successfully discriminates winter and summer operations from periods of refurbishment or malfunctioning of the heating system.


Author(s):  
Trang T Le ◽  
Jason H Moore

Abstract Summary treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students’ understanding of a simple decision tree model before diving into more complex tree-based machine learning methods. Availability and implementation The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.


Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2344 ◽  
Author(s):  
Federico Pittino ◽  
Michael Puggl ◽  
Thomas Moldaschl ◽  
Christina Hirschl

Anomaly detection is becoming increasingly important to enhance reliability and resiliency in the Industry 4.0 framework. In this work, we investigate different methods for anomaly detection on in-production manufacturing machines taking into account their variability, both in operation and in wear conditions. We demonstrate how the nature of the available data, featuring any anomaly or not, is of importance for the algorithmic choice, discussing both statistical machine learning methods and control charts. We finally develop methods for automatic anomaly detection, which obtain a recall close to one on our data. Our developed methods are designed not to rely on a continuous recalibration and hand-tuning by the machine user, thereby allowing their deployment in an in-production environment robustly and efficiently.


2021 ◽  
Author(s):  
Rosa Lavelle-Hill ◽  
Anjali Mazumder ◽  
James Goulding ◽  
Gavin Smith ◽  
Todd Landman

Abstract 40 million people are estimated to be in some form of modern slavery across the globe. Understanding the factors that make any particular individual or geographical region vulnerable to such abuse is essential for the development of effective interventions and policy. Efforts to isolate and assess the importance of individual drivers statistically are impeded by two key challenges: data scarcity and high dimensionality. The hidden nature of modern slavery restricts available datapoints; and the large number of candidate variables that are potentially predictive of slavery inflates the feature space exponentially. The result is a highly problematic "small-n, large-p' setting, where overfitting and multi-collinearity can render more traditional statistical approaches inapplicable. Recent advances in non-parametric computational methods, however, offer scope to overcome such challenges. We present an approach that combines non-linear machine learning models and strict cross-validation methods with novel variable importance techniques, emphasising the importance of stability of model explanations via Rashomon-set analysis. This approach is used to model the prevalence of slavery in 48 countries, with results bringing to light the importance predictive factors - such as a country's capacity to protect the physical security of women, which has previously been under-emphasized in the literature. Out-of-sample estimates of slavery prevalence are then made for countries where no survey data currently exists.


2020 ◽  
Vol 2 (1) ◽  
pp. 54
Author(s):  
Rok Novak ◽  
David Kocman ◽  
Johanna Amalia Robinson ◽  
Tjaša Kanduč ◽  
Denis Sarigiannis ◽  
...  

The merge of new sensing technologies with machine learning methods can be used as a tool to recognize complex activities. A wearable particulate matter (PM) sensor, in combination with a motion tracker, was provided to 97 individuals for 7 days in two seasons. These data sets were used in three different models, constructed by the classification of activity. Using algorithms IBk, J48 and RandomForest for hourly (minute) values, an accuracy of 31.0 (23.1)%, 28.6 (22.0)% and 35.7 (23.0)%, respectively, was achieved. Most misclassified instances concern vaguely defined activities. Low accuracy can also be explained with the differences in time scales. The accuracy could be improved by more clearly defining the activities and collecting per-minute data.


Sign in / Sign up

Export Citation Format

Share Document