Big-But-Biased Data Analytics for Air Quality

Laura Borrajo; Ricardo Cao

doi:10.3390/electronics9091551

Big-But-Biased Data Analytics for Air Quality

Electronics ◽

10.3390/electronics9091551 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1551

Author(s):

Laura Borrajo ◽

Ricardo Cao

Keyword(s):

Air Quality ◽

Data Analytics ◽

Mean Squared Error ◽

Smart Cities ◽

Real Data ◽

Cumulative Distribution ◽

Simple Random Sample ◽

Urban Air ◽

Data Set ◽

The Mean

Air pollution is one of the big concerns for smart cities. The problem of applying big data analytics to sampling bias in the context of urban air quality is studied in this paper. A nonparametric estimator that incorporates kernel density estimation is used. When ignoring the biasing weight function, a small-sized simple random sample of the real population is assumed to be additionally observed. The general parameter considered is the mean of a transformation of the random variable of interest. A new bootstrap algorithm is used to approximate the mean squared error of the new estimator. Its minimization leads to an automatic bandwidth selector. The method is applied to a real data set concerning the levels of different pollutants in the urban air of the city of A Coruña (Galicia, NW Spain). Estimations for the mean and the cumulative distribution function of the level of ozone and nitrogen dioxide when the temperature is greater than or equal to 30 ∘C based on 15 years of biased data are obtained.

Download Full-text

Evaluation for estimating of the PDF and the CDF of Generalized Inverted Exponential Distribution with Application in Industry

Advances in Mathematics: Scientific Journal ◽

10.37418/amsj.9.1.39 ◽

2020 ◽

pp. 507-522

Author(s):

Parisa Torkaman

Keyword(s):

Least Squares ◽

Exponential Distribution ◽

Mean Squared Error ◽

Weighted Least Squares ◽

Real Data ◽

Minimum Variance ◽

Cumulative Distribution ◽

Estimation Methods ◽

Data Set ◽

Better Than

The generalized inverted exponential distribution is introduced as a lifetime model with good statistical properties. This paper, the estimation of the probability density function and the cumulative distribution function of with five different estimation methods: uniformly minimum variance unbiased(UMVU), maximum likelihood(ML), least squares(LS), weighted least squares (WLS) and percentile(PC) estimators are considered. The performance of these estimation procedures, based on the mean squared error (MSE) by numerical simulations are compared. Simulation studies express that the UMVU estimator performs better than others and when the sample size is large enough the ML and UMVU estimators are almost equivalent and efficient than LS, WLS and PC. Finally, the result using a real data set are analyzed.

Download Full-text

Estimation of density and distribution functions of a Burr X distribution

10.47302/jsr.2018520103 ◽

2018 ◽

Vol 52 (1) ◽

pp. 43-59

Author(s):

AMULYA KUMAR MAHTO ◽

YOGESH MANI TRIPATH ◽

SANKU DEY

Keyword(s):

Mean Squared Error ◽

Real Data ◽

Distribution Functions ◽

Least Square ◽

Efficient Estimation ◽

Cumulative Distribution ◽

Estimation Methods ◽

Unbiased Estimation ◽

Least Square Estimation ◽

Data Set

Burr type X distribution is one of the members of the Burr family which was originally derived by Burr (1942) and can be used quite effectively in modelling strength data and also general lifetime data. In this article, we consider efficient estimation of the probability density function (PDF) and cumulative distribution function (CDF) of Burr X distribution. Eight different estimation methods namely maximum likelihood estimation, uniformly minimum variance unbiased estimation, least square estimation, weighted least square estimation, percentile estimation, maximum product estimation, Cremer-von-Mises estimation and Anderson-Darling estimation are considered. Analytic expressions for bias and mean squared error are derived. Monte Carlo simulations are performed to compare the performances of the proposed methods of estimation for both small and large samples. Finally, a real data set has been analyzed for illustrative purposes.

Download Full-text

New Lindley Half Cauchy Distribution: Theory and Applications

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d4734.119420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1-7

Keyword(s):

Maximum Likelihood ◽

Real Data ◽

Cauchy Distribution ◽

Least Square ◽

Cumulative Distribution ◽

Likelihood Method ◽

Estimation Methods ◽

Model Parameters ◽

Data Set ◽

Von Mises

In this paper, we have defined a new two-parameter new Lindley half Cauchy (NLHC) distribution using Lindley-G family of distribution which accommodates increasing, decreasing and a variety of monotone failure rates. The statistical properties of the proposed distribution such as probability density function, cumulative distribution function, quantile, the measure of skewness and kurtosis are presented. We have briefly described the three well-known estimation methods namely maximum likelihood estimators (MLE), least-square (LSE) and Cramer-Von-Mises (CVM) methods. All the computations are performed in R software. By using the maximum likelihood method, we have constructed the asymptotic confidence interval for the model parameters. We verify empirically the potentiality of the new distribution in modeling a real data set.

Download Full-text

Inaccuracies with Multimodel Postprocessing Methods Involving Weighted, Regression-Corrected Forecasts

Monthly Weather Review ◽

10.1175/mwr-d-15-0204.1 ◽

2016 ◽

Vol 144 (4) ◽

pp. 1649-1668 ◽

Cited By ~ 13

Author(s):

Daniel Hodyss ◽

Elizabeth Satterfield ◽

Justin McLay ◽

Thomas M. Hamill ◽

Michael Scheuerer

Keyword(s):

Bayesian Model Averaging ◽

Mean Squared Error ◽

Low Cost ◽

Model Averaging ◽

Real Data ◽

Direct Application ◽

Weighted Regression ◽

Ensemble Forecasts ◽

Wave Heights ◽

The Mean

Abstract Ensemble postprocessing is frequently applied to correct biases and deficiencies in the spread of ensemble forecasts. Methods involving weighted, regression-corrected forecasts address the typical biases and underdispersion of ensembles through a regression correction of ensemble members followed by the generation of a probability density function (PDF) from the weighted sum of kernels fit around each corrected member. The weighting step accounts for the situation where the ensemble is constructed from different model forecasts or generated in some way that creates ensemble members that do not represent equally likely states. In the present work, it is shown that an overweighting of climatology in weighted, regression-corrected forecasts can occur when one first performs a regression-based correction before weighting each member. This overweighting of climatology results in an increase in the mean-squared error of the mean of the predicted PDF. The overweighting of climatology is illustrated in a simulation study and a real-data study, where the reference is generated through a direct application of Bayes’s rule. The real-data example is a comparison of a particular method referred to as Bayesian model averaging (BMA) and a direct application of Bayes’s rule for ocean wave heights using U.S. Navy and National Weather Service global deterministic forecasts. This direct application of Bayes’s rule is shown to not overweight climatology and may be a low-cost replacement for the generally more expensive weighted, regression-correction methods.

Download Full-text

Estimation of Population Mean in Chain Ratio-Type Estimator under Systematic Sampling

Journal of Probability and Statistics ◽

10.1155/2015/248374 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 4

Author(s):

Mursala Khan ◽

Rajesh Singh

Keyword(s):

Finite Population ◽

Real Data ◽

Survey Sampling ◽

Systematic Sampling ◽

Auxiliary Variables ◽

Data Set ◽

Population Mean ◽

First Order ◽

The Mean ◽

A Chain

A chain ratio-type estimator is proposed for the estimation of finite population mean under systematic sampling scheme using two auxiliary variables. The mean square error of the proposed estimator is derived up to the first order of approximation and is compared with other relevant existing estimators. To illustrate the performances of the different estimators in comparison with the usual simple estimator, we have taken a real data set from the literature of survey sampling.

Download Full-text

Modified Ridge Regression Estimator with the Application of Peanut Production in Pakistan

Asian Journal of Advanced Research and Reports ◽

10.9734/ajarr/2019/v7i230172 ◽

2019 ◽

pp. 1-8

Author(s):

Asifa Mubeen ◽

Nasir Jamal ◽

Muhammad Hanif ◽

Usman Shahzad

Keyword(s):

Ridge Regression ◽

Real Data ◽

Production Data ◽

Regression Estimator ◽

Mean Square ◽

Data Set ◽

Regression Estimators ◽

Peanut Production ◽

Ridge Regression Estimator ◽

The Mean

The main objective of the present study was to develop a new ridge regression estimator and fit the ridge regression model to the peanut production data of Pakistan. Peanut production data has been used to analyze the results. The data has been taken peanut production and growth rate of Pakistan. The mean square error of the proposed estimator is compared with some existing ridge regression estimators. In this study, we proposed a ridge regression estimator. The properties of proposed estimators are also discussed. The real data set of peanut production is used for assuming the performance of proposed and existing estimators. Numerical results of real data set show that proposed ridge regression estimator provides best results as compare to reviewed ones.

Download Full-text

Regression Estimator Using Double Ranked Set Sampling

Sultan Qaboos University Journal for Science [SQUJS] ◽

10.24200/squjs.vol7iss2pp311-322 ◽

2002 ◽

Vol 7 (2) ◽

pp. 311

Author(s):

Hani M. Samawi ◽

Eman M. Tawalbeh

Keyword(s):

Correlation Coefficient ◽

Real Data ◽

Simple Random Sampling ◽

Auxiliary Variable ◽

Ranked Set Sampling ◽

Regression Estimator ◽

Primary Analysis ◽

Data Set ◽

Population Mean ◽

The Mean

The performance of a regression estimator based on the double ranked set sample (DRSS) scheme, introduced by Al-Saleh and Al-Kadiri (2000), is investigated when the mean of the auxiliary variable X is unknown. Our primary analysis and simulation indicates that using the DRSS regression estimator for estimating the population mean substantially increases relative efficiency compared to using regression estimator based on simple random sampling (SRS) or ranked set sampling (RSS) (Yu and Lam, 1997) regression estimator. Moreover, the regression estimator using DRSS is also more efficient than the naïve estimators of the population mean using SRS, RSS (when the correlation coefficient is at least 0.4) and DRSS for high correlation coefficient (at least 0.91.) The theory is illustrated using a real data set of trees.

Download Full-text

Closed form expressions for moments of the beta Weibull distribution

Anais da Academia Brasileira de Ciências ◽

10.1590/s0001-37652011000200002 ◽

2011 ◽

Vol 83 (2) ◽

pp. 357-373 ◽

Cited By ~ 19

Author(s):

Gauss M Cordeiro ◽

Alexandre B Simas ◽

Borko D Stošic

Keyword(s):

Weibull Distribution ◽

Closed Form ◽

Cumulative Distribution Function ◽

Information Matrix ◽

Likelihood Estimation ◽

Real Data ◽

Cumulative Distribution ◽

Data Set ◽

Expected Information ◽

Beta Weibull Distribution

The beta Weibull distribution was first introduced by Famoye et al. (2005) and studied by these authors and Lee et al. (2007). However, they do not give explicit expressions for the moments. In this article, we derive explicit closed form expressions for the moments of this distribution, which generalize results available in the literature for some sub-models. We also obtain expansions for the cumulative distribution function and Rényi entropy. Further, we discuss maximum likelihood estimation and provide formulae for the elements of the expected information matrix. We also demonstrate the usefulness of this distribution on a real data set.

Download Full-text

Stacking Regression Algorithms to Predict PM2.5 in the Smart City Using Internet of Things

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200628094351 ◽

2020 ◽

Vol 13 ◽

Cited By ~ 1

Author(s):

Alisha Banga ◽

Ravinder Ahuja ◽

Subhash Chander Sharma

Keyword(s):

Air Quality ◽

Internet Of Things ◽

Urban Areas ◽

Smart Cities ◽

Absolute Error ◽

Dew Point ◽

Beijing City ◽

Data Set ◽

Regression Algorithms ◽

Ensemble Algorithms

Background and Objective: With the increase in populations in urban areas, there is an increase in pollution also. Air pollution is one of the challenging environmental issues in smart cities. Real-time monitoring of air quality can help the administration to take appropriate decisions on time. Development in the Internet of Things based sensors has changed the way to monitor air quality. Methods: In this paper, we have applied two-stage regressions. In the first stage, ten regression algorithms (Decision Tree, Random Forest, Elastic Net, Adaboost, Extra Tree, Linear Regression, Lasso, XGBoost, Light GBM, AdaBoost, and Multi-Layer Perceptron) is applied and in second stage best four algorithms are picked and stacking ensemble algorithms is applied using python to predict the PM2.5 pollutants in air. Data set of five Chinese cities (Beijing, Chengdu, Guangzhou, Shanghai, and Shenyang) has taken into consideration and compared based on MAE (Mean Absolute Error), RMSE (Root Mean Square Error), and R2 parameters. Results and Conclusion: We observed that out of ten regression algorithms applied extra tree algorithm is giving the highest performance on all the five datasets, and stacking further improves the performance. Feature importance for Sheyang, and Beijing city is computed using three regression algorithms, and we found the four most important features are Humidity, wind speed, wind direction, and dew point.

Download Full-text

The complementary Poisson-Lindley class of distributions

International Journal of Advanced Statistics and Probability ◽

10.14419/ijasp.v3i2.4624 ◽

2015 ◽

Vol 3 (2) ◽

pp. 146 ◽

Cited By ~ 1

Author(s):

Amal Hassan ◽

Salwa Assar ◽

Kareem Ali

Keyword(s):

Information Matrix ◽

Real Data ◽

Formal Proof ◽

Cumulative Distribution ◽

Likelihood Method ◽

Unknown Parameters ◽

Data Set ◽

Lifetime Distributions ◽

New Class ◽

Rate Functions

<p>This paper proposed a new general class of continuous lifetime distributions, which is a complementary to the Poisson-Lindley family proposed by Asgharzadeh et al. [3]. The new class is derived by compounding the maximum of a random number of independent and identically continuous distributed random variables, and Poisson-Lindley distribution. Several properties of the proposed class are discussed, including a formal proof of probability density, cumulative distribution, and reliability and hazard rate functions. The unknown parameters are estimated by the maximum likelihood method and the Fisher’s information matrix elements are determined. Some sub-models of this class are investigated and studied in some details. Finally, a real data set is analyzed to illustrate the performance of new distributions.</p>

Download Full-text