scholarly journals Distributed Nonparametric and Semiparametric Regression on SPARK for Big Data Forecasting

2017 ◽  
Vol 2017 ◽  
pp. 1-13
Author(s):  
Jelena Fiosina ◽  
Maksims Fiosins

Forecasting in big datasets is a common but complicated task, which cannot be executed using the well-known parametric linear regression. However, nonparametric and semiparametric methods, which enable forecasting by building nonlinear data models, are computationally intensive and lack sufficient scalability to cope with big datasets to extract successful results in a reasonable time. We present distributed parallel versions of some nonparametric and semiparametric regression models. We used MapReduce paradigm and describe the algorithms in terms of SPARK data structures to parallelize the calculations. The forecasting accuracy of the proposed algorithms is compared with the linear regression model, which is the only forecasting model currently having parallel distributed realization within the SPARK framework to address big data problems. The advantages of the parallelization of the algorithm are also provided. We validate our models conducting various numerical experiments: evaluating the goodness of fit, analyzing how increasing dataset size influences time consumption, and analyzing time consumption by varying the degree of parallelism (number of workers) in the distributed realization.

2017 ◽  
Vol 20 (K2) ◽  
pp. 117-125
Author(s):  
Tuan Hoang Le ◽  
Dung Anh To

Flood forecasting is very important research topic in disaster prevention and reduction. The characteristics of flood involve a rather complex systematic dynamic under the influence of different meteorological factors including linear and non-linear patterns. Recently there are many novel forecasting methods of improving the forecasting accuracy. This paper explores the potential and effect of the semiparametric regression to modelize flood water-level and to forecast the inundation of Mekong Delta in Vietnam. The semi-parametric regression technique is a combination of a parametric regression approach and a non-parametric regression concept. In the process of model building, three altered linear regression models are applied for the parametric component. They are stepwise multiple linear regression, partial least squares solution and multirecursive regression method. They are used to capture flood’s linear characteristics. The nonparametric part is solved by a modified estimation of a smooth function. Furthermore, some justified nonlinear regression models based on artificial neural network are also able to obtain flood’s non-linear characteristics. They help us to smooth the model's non-parametric constituent easily and quickly. The last element is the model's error. Then the semiparametric regression is used for ensemble model based on the principle component analysis technique. Flood water-level forecasting, with a lead time of one and more days, has been made by using a selected sequence of past water-level values and some relevant factors observed at a specific location. Time-series analytical method is utilized to build the model. Obtained empirical results indicate that the prediction by using the amended semi-parametric regression ensemble model is generally better than those obtained by using the other models presented in this study in terms of the same evaluation measurements. Our findings reveal that the estimation power of the modern statistical model is reliable and auspicious. The proposed model here can be used as a promising alternative forecasting tool for flood to achieve better forecasting accuracy and to optimize prediction quality further.


2021 ◽  
Vol 28 (1) ◽  
pp. 78-83
Author(s):  
O. O. ONI ◽  
N. I. DIM ◽  
B. Y. ABUBAKAR ◽  
O. E. ASIRIBO

Data on the monthly egg production of a strain of Rhode Island chickens (500 breeder hens) were used to test the goodness of fit of six mathematical models, viz; Exponential, Parabolic exponential, Wood's Gamma type and modified Gamma type by McNally, Inverse polynomial and Linear regression. Egg production was summarized for each hen into 28-d periods, starting from the day of firts egg. The hens were classified into different production cycle length based on the number of 28-d periods. The models were fitted to the mean results obtained for periods within groups of hens. The effect of cycle length on goodness of fit was also examined separately for the 'best' three models with highest R2 values. The egg production cycle (i.e. number of 28-d periods) varied from 9 to 15 periods. Similarly, the coefficients of determination (R2) varied from 0.16 to 0.95 from fitting the models to mean egg production data for groups of hens. The results suggest that thye 'best' three models that were chosen fitted 52 week laying records quite well, judging from their respective R2, which were higherf for McNally (0.95) and Parabolic exponential (0.93) than for wood (0.75). Based on the goodness of fit to 52-week production record, the McNally model gave the best results. However, its suitability in predicting full year production from part year record needs to be further investigated.


2021 ◽  
Vol 46 (1) ◽  
Author(s):  
C. E. Chigbundu ◽  
K. O. Adebowale

Dyes are complex and sensitive organic chemicals which exposes microbial populations, aquatic lives and other living organisms to its toxic effects if their presence in water bodies or industrial effluents are not properly handled. This work therefore, comparatively studied the adsorption efficiencies of natural raw kaolinite (NRK) clay adsorbent and dimethyl sulphoxide (DMSO) faciley intercalated kaolinite clay (DIK) adsorbent for batch adsorption of Basis Red 2 (BR2) dye. The impact of varying the contact time, temperature and other operating variables on adsorption was also considered. The two adsorbents were characterized using SEM images, FTIR and XRD patterns. Linear and non-linear regression analysis of different isotherm and kinetic models were used to describe the appropriate fits to the experimental data. Error analysis equations were also used to measure the goodness-of-fit. Langmuir isotherm model best described the adsorption as being monolayer on homogenous surfaces while Kinetic studies showed that Elovich model provides the best fit to experimental data. The adsorption capacities of NRK and DIK adsorbents for the uptake of BR2 were 16.30 mg/g and 32.81 mg/g, respectively (linear regression) and 19.30 mg/g and 30.81 mg/g, respectively (non-linear regression). The thermodynamic parameter, ∆G showed that BR2 dye adsorption onto the adsorbents were spontaneous. DIK adsorbent was twice efficient compared with NRK for the uptake of BR2 dye.


2019 ◽  
Author(s):  
soumya banerjee

Modelling and forecasting port throughput enables stakeholders to make efficient decisions ranging from management of port development, to infrastructure investments, operational restructuring and tariffs policy. Accurate forecasting of port throughput is also critical for long-term resource allocation and short-term strategic planning. In turn, efficient decision-making enhances the competitiveness of a port. However, in the era of big data we are faced with the enviable dilemma of having too much information. We pose the question: is more information always better for forecasting? We suggest that more information comes at the cost of more parameters of the forecasting model that need to be estimated. We comparemultiple forecasting models of varying degrees of complexity and quantify the effect of the amount of data on model forecasting accuracy. Our methodology serves as a guideline for practitioners in this field. We also enjoin caution that even in the era of big data more information may not always be better. It would be advisable for analysts to weigh the costs of adding more data: the ultimate decision would depend on the problem, amount of data and the kind of models being used.


Author(s):  
Sudaryanto Sudaryanto ◽  
Jery Courvisanos ◽  
Alif Puji Rahayu

Objective - The purpose of this study is to identify the influence of similarity, reputation, perceived risk, and innovation as brand extensions of smartphones developed by Samsung, toward brand equity. Methodology/Technique - This study uses explanatory research. The population in this study consists of consumers of Samsung Galaxy mobiles for at least one month. Questionnaires were delivered to the respondents, after it had passed the validity and reliability tests. Following on from the the statistical testing, the data was analysed using a multiple linear regression. Then, the classical assumption test was conducted to determine the goodness of fit of the model. The data was collected using a questionnaire consisting of a closed statement, measured by a Likert Scale Findings - The results of this study show that similarity, reputation, perceived risk, and innovation as the variable dimensions have a significant effect on Brand Equity of Samsung Galaxy mobiles. Type of Paper: Empirical Keywords: Brand Extension; Brand Equity; Similarity; Reputation; Perceived Risk; Innovation; Explanatory Research. JEL Classification: M3, M30, M39.


Web Services ◽  
2019 ◽  
pp. 314-331 ◽  
Author(s):  
Sema A. Kalaian ◽  
Rafa M. Kasim ◽  
Nabeel R. Kasim

Data analytics and modeling are powerful analytical tools for knowledge discovery through examining and capturing the complex and hidden relationships and patterns among the quantitative variables in the existing massive structured Big Data in efforts to predict future enterprise performance. The main purpose of this chapter is to present a conceptual and practical overview of some of the basic and advanced analytical tools for analyzing structured Big Data. The chapter covers descriptive and predictive analytical methods. Descriptive analytical tools such as mean, median, mode, variance, standard deviation, and data visualization methods (e.g., histograms, line charts) are covered. Predictive analytical tools for analyzing Big Data such as correlation, simple- and multiple- linear regression are also covered in the chapter.


Author(s):  
Saranya N. ◽  
Saravana Selvam

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.


Sign in / Sign up

Export Citation Format

Share Document