Large-Scale Malicious Software Classification with Fuzzified Features and Boosted Fuzzy Random Forest

Data imputation algorithms for mixed variable types in large scale educational assessment: a comparison of random forest, multivariate imputation using chained equations, and MICE with recursive partitioning

International Journal of Quantitative Research in Education ◽

10.1504/ijqre.2016.077803 ◽

2016 ◽

Vol 3 (3) ◽

pp. 129 ◽

Cited By ~ 2

Author(s):

W. Holmes Finch ◽

Maria E. Hernandez Finch ◽

Melissa Singh

Keyword(s):

Random Forest ◽

Large Scale ◽

Recursive Partitioning ◽

Educational Assessment ◽

Data Imputation ◽

Mixed Variable

Download Full-text

A MapReduce-Based Parallel Random Forest Approach for Predicting Large-Scale Protein-Protein Interactions

Intelligent Computing Methodologies - Lecture Notes in Computer Science ◽

10.1007/978-3-030-60796-8_34 ◽

2020 ◽

pp. 400-407

Author(s):

Bo-Ya Ji ◽

Zhu-Hong You ◽

Long Yang ◽

Ji-Ren Zhou ◽

Peng-Wei Hu

Keyword(s):

Random Forest ◽

Protein Interactions ◽

Large Scale ◽

Protein Protein Interactions

Download Full-text

A Hybrid Fuzzy Random Forest Algorithm Using Harmonic Search Algorithm for Parameter Tuning

10.1109/aeeca52519.2021.9574159 ◽

2021 ◽

Author(s):

Ying Yu ◽

Feng Yu ◽

Yanru He ◽

Wenwen Yang

Keyword(s):

Random Forest ◽

Search Algorithm ◽

Parameter Tuning ◽

Random Forest Algorithm ◽

Fuzzy Random

Download Full-text

Downscaling of GRACE-Derived Groundwater Storage Based on the Random Forest Model

Remote Sensing ◽

10.3390/rs11242979 ◽

2019 ◽

Vol 11 (24) ◽

pp. 2979 ◽

Cited By ~ 3

Author(s):

Li Chen ◽

Qisheng He ◽

Kun Liu ◽

Jinyang Li ◽

Chenlin Jing

Keyword(s):

Random Forest ◽

Spatial Resolution ◽

Large Scale ◽

Snow Water Equivalent ◽

Water Storage ◽

Research Area ◽

Groundwater Storage ◽

Coarse Spatial Resolution ◽

Long Time ◽

Local Water

Groundwater is an important part of water storage and one of the important sources of agricultural irrigation, urban living, and industrial water use. The recent launch of Gravity Recovery and Climate Experiment (GRACE) Satellite has provided a new way for studying large-scale water storage. The application of GRACE in local water resources has been greatly limited because of the coarse spatial resolution, and low temporal resolution. Therefore, it is of great significance to improve the spatial resolution of groundwater storage for regional water management. Based on the method of random forest (RF), this study combined six hydrological variables, including precipitation, evapotranspiration, runoff, soil moisture, snow water equivalent, and canopy water to conduct downscaling study, aiming at downscaling the resolution of the total water storage and groundwater storage from 1° (110 km) and to 0.25° (approximately 25 km). The results showed that, from the perspective of long time series, the prediction results of the RF model are ideal in the whole research area and the observations wells area. From the perspective of space, the detailed changes of water storage could be captured in greater detail after downscaling. The verification results show that, on the monthly scale and annual scale, the correlation between the downscaling results and the observation wells is 0.78 and 0.94, respectively, and they both reach the confidence level of 0.01. Therefore, the RF downscaling model has great potential for predicting groundwater storage.

Download Full-text

Modeling Posidonia oceanica shoot density and rhizome primary production

Scientific Reports ◽

10.1038/s41598-020-73722-9 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Elena Catucci ◽

Michele Scardi

Keyword(s):

Random Forest ◽

Primary Production ◽

Large Scale ◽

Posidonia Oceanica ◽

Shoot Density ◽

Large Scale Assessment ◽

Goods And Services ◽

Predictive Variables ◽

Management Perspective ◽

Predicted Values

Abstract Posidonia oceanica meadows rank among the most important and most productive ecosystems in the Mediterranean basin, due to their ecological role and to the goods and services they provide. Estimations of crucial ecological process such as meadows productivity could play a major role in an environmental management perspective and in the assessment of P. oceanica ecosystem services. In this study, a Machine Learning approach, i.e. Random Forest, was aimed at modeling P. oceanica shoot density and rhizome primary production using as predictive variables only environmental factors retrieved from indirect measurements, such as maps. Our predictive models showed a good level of accuracy in modeling both shoot density and rhizome productivity (R2 = 0.761 and R2 = 0.736, respectively). Furthermore, as shoot density is an essential parameter in the estimation of P. oceanica productivity, we proposed a cascaded approach aimed at estimating the latter using predicted values of shoot density rather than observed measurements. In spite of the complexity of the problem, the cascaded Random Forest performed quite well (R2 = 0.637). While direct measurements will always play a fundamental role, our estimates could support large scale assessment of the expected condition of P. oceanica meadows, providing valuable information about the way this crucial ecosystem works.

Download Full-text

Comparison of Machine Learning Methods for Android Malicious Software Classification based on System Call

2019 International Conference on Sustainable Information Engineering and Technology (SIET) ◽

10.1109/siet48054.2019.8985998 ◽

2019 ◽

Author(s):

Mochammad Anshori ◽

Farhanna Mar'i ◽

Fitra A. Bachtiar

Keyword(s):

Machine Learning ◽

System Call ◽

Learning Methods ◽

Malicious Software ◽

Machine Learning Methods ◽

Software Classification

Download Full-text

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

Journal of Clinical and Translational Science ◽

10.1017/cts.2019.8 ◽

2019 ◽

Vol 3 (s1) ◽

pp. 2-2

Author(s):

Megan C Hollister ◽

Jeffrey D. Blume

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Random Forest ◽

Gene Expression Data ◽

Large Scale ◽

Second Generation ◽

A Priori ◽

Expression Data ◽

P Values ◽

Machine Learning Methods

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.

Download Full-text

Tracking the Land Use/Land Cover Change in an Area with Underground Mining and Reforestation via Continuous Landsat Classification

Remote Sensing ◽

10.3390/rs11141719 ◽

2019 ◽

Vol 11 (14) ◽

pp. 1719 ◽

Cited By ~ 7

Author(s):

Jiaxin Mi ◽

Yongjun Yang ◽

Shaoliang Zhang ◽

Shi An ◽

Huping Hou ◽

...

Keyword(s):

Land Use ◽

Random Forest ◽

Land Cover ◽

Large Scale ◽

Underground Mining ◽

Mining Area ◽

Random Forest Classifier ◽

Land Use Land Cover ◽

Lulc Change ◽

Mining Areas

Understanding the changes in a land use/land cover (LULC) is important for environmental assessment and land management. However, tracking the dynamic of LULC has proved difficult, especially in large-scale underground mining areas with extensive LULC heterogeneity and a history of multiple disturbances. Additional research related to the methods in this field is still needed. In this study, we tracked the LULC change in the Nanjiao mining area, Shanxi Province, China between 1987 and 2017 via random forest classifier and continuous Landsat imagery, where years of underground mining and reforestation projects have occurred. We applied a Savitzky–Golay filter and a normalized difference vegetation index (NDVI)-based approach to detect the temporal and spatial change, respectively. The accuracy assessment shows that the random forest classifier has a good performance in this heterogeneous area, with an accuracy ranging from 81.92% to 86.6%, which is also higher than that via support vector machine (SVM), neural network (NN), and maximum likelihood (ML) algorithm. LULC classification results reveal that cultivated forest in the mining area increased significantly after 2004, while the spatial extent of natural forest, buildings, and farmland decreased significantly after 2007. The areas where vegetation was significantly reduced were mainly because of the transformation from natural forest and shrubs into grasslands and bare lands, respectively, whereas the areas with an obvious increase in NDVI were mainly because of the conversion from grasslands and buildings into cultivated forest, especially when villages were abandoned after mining subsidence. A partial correlation analysis demonstrated that the extent of LULC change was significantly related to coal production and reforestation, which indicated the effects of underground mining and reforestation projects on LULC changes. This study suggests that continuous Landsat classification via random forest classifier could be effective in monitoring the long-term dynamics of LULC changes, and provide crucial information and data for the understanding of the driving forces of LULC change, environmental impact assessment, and ecological protection planning in large-scale mining areas.

Download Full-text

Statistical Downscaling of daily extreme Sea Level with Random Forest: Examples from South-East Asia and the Baltic Sea

10.5194/egusphere-egu2020-10082 ◽

2020 ◽

Author(s):

Svenja Bierstedt ◽

Eduardo Zorita ◽

Birgit Hünicke

Keyword(s):

Random Forest ◽

Baltic Sea ◽

East Asia ◽

Sea Level ◽

Large Scale ◽

South East Asia ◽

The Baltic Sea ◽

Sea Levels ◽

Extreme Sea Levels ◽

The Baltic

<p>The coastlines of the Baltic Sea and Indonesia are both relatively complex, so that the estimation of extreme sea levels caused by the atmospheric forcing becomes complex with conventional methods. Here, we explore whether Machine Learning methods can provide a model surrogate to compute more rapidly daily extremes in sea level from large-scale atmosphere-ocean fields. We investigate the connections between the atmospheric and ocean drivers of local extreme sea level in South East Asia and along the Baltic Sea based on statistical analysis by Random Forest Models, driven by large-scale meteorological predictors and daily extreme sea level measured by tide-gauge records over the last few decades.</p><p>First results show that in some Indonesian areas extremes are driven by large-scale climate fields; in other areas they are incoherently driven by local processes. An area where random forest predicted extremes show good correspondence to observed extremes is found to be the Malaysian coastline. For the Indonesian coasts, the Random Forest Algorithm was unable to predict extreme sea levels in line with observations. Along the Baltic Sea, in contrast, the Random Forest model is able to produce reasonable estimations of extreme sea levels based on the large-scale atmospheric fields. An analysis of the interrelations of extreme sea levels in the South Asia regions suggests that either the data quality may be compromised in some regions or that other forcing factors, distinct from the large-scale atmospheric fields, may also be involved.</p>

Download Full-text

An Efficient Retinal Vessels Biometric Recognition System by Using Multi-Scale Local Binary Pattern Descriptor

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2020.3180 ◽

2020 ◽

Vol 10 (10) ◽

pp. 2481-2489

Author(s):

Muhammad Sheraz Arshad Malik ◽

Qoseen Zahra ◽

Imran Ullah Khan ◽

Muhammad Awais ◽

Gang Qiao

Keyword(s):

Feature Extraction ◽

Random Forest ◽

Efficient Method ◽

Large Scale ◽

Local Binary Pattern ◽

Evaluation Criteria ◽

Recognition System ◽

Suggested Approach ◽

Multi Scale ◽

Biometric Systems

Biometric systems are technically used for human recognition by identifying the unique features of an individual. Many security issues are found related to biometric systems such as voice, fingerprints, face, iris, signatures, etc., but the retina is a unique and efficient method to identify valid one. The aim of this paper is provided with an efficient method to recognize someone based on unique retina features. A proposed system based on retinal blood vessel pattern by using multi-scale local binary pattern (MSLBP) and random forest (Bagging tree) as feature extraction and classification. MSLBP is an efficient method to extracted features at six scales perpixel level, earlier work found the deficiency based on simple binary pattern with coverage of small areas and per-pixel level in the surrounding. MSLBP and random forest classifier suggested approach use for improving usability, perceivability, and sensitivity on large scale areas. It is the fastest method to get features accurately in an efficient way at every level of pixels. This method based on deep learning evaluation (criteria) parameter selection that provides more significant influence with sharp feature extraction on large scale areas based on seconds and improves the efficiency of images. MSLBP overcomes the problem of image sizing, pixel levels and efficiently provide accurate results.

Download Full-text