scholarly journals Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme

Stats ◽  
2020 ◽  
Vol 3 (4) ◽  
pp. 444-464
Author(s):  
Hellen Paz ◽  
Mateus Maia ◽  
Fernando Moraes ◽  
Ricardo Lustosa ◽  
Lilia Costa ◽  
...  

The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.

2021 ◽  
Vol 63 (3 May-Jun) ◽  
pp. 371-381
Author(s):  
Vanessa De la Cruz-Góngora ◽  
Teresa Shamah-Levy ◽  
Salvador Villalpando ◽  
Ignacio Méndez-Gómez Humarán ◽  
Rosario Rebollar-Campos ◽  
...  

Objective. To describe trends in zinc deficiency (ZD) prevalence among preschool-age Mexican children, and explore differences in this trend among beneficiaries of the conditional cash transfer program Progresa/Oportunidades/ Prospera (CCT-POP). Materials and methods. The serum zinc information of children aged 1-4 who participa­ted in the ENN 1999, Ensanut 2006 and Ensanut 2018-19 was analyzed. ZD was categorized according to IZiNCG cutoff values. Logistic regression models were used to iden­tify personal participant characteristics associated with ZD trends, and tests for interactions between survey CCT-POP beneficiaries were applied. Results. ZD decreased by 22.3 percentage points (pp) between ENN 1999 and Ensanut 2018-19; among CCT-POP beneficiaries, the decrease was 58.6 pp. Overweight was associated with higher odds of ZD (OR=2.18, p=0.023). Conclusions. In the last 19 years, ZD declined significantly among preschool-age Mexican children. Child beneficiaries of the social program CCT-POP showed the largest reduction of ZD.


2019 ◽  
Vol 35 (6) ◽  
Author(s):  
Claudia Helena Soares de Morais Freitas ◽  
Franklin Delano Soares Forte ◽  
Maria Helena Rodrigues Galvão ◽  
Ardigleusa Alves Coelho ◽  
Angelo Giuseppe Roncalli ◽  
...  

This study aims to evaluate the social determinants of access to HIV and VDRL tests during pregnancy in Brazil. The dependent variables were based on prenatal care access: prenatal care appointments, no HIV and syphilis tests. The independent variables at the first level were formal education level, age, race, work and participation in the Family Income program conditional cash transfer program. The city-level variables were the human development index (HDI), Gini index, and indicators related to health services. An exploratory analysis was performed assessing the effect of each level through prevalence ratios (PR) calculation. A multilevel mixed-effect Poisson regression model was constructed for all outcomes to verify the effect of individual level and with both the individual and contextual levels. Regarding prenatal appointments, the main implicated factors were related to individual socioeconomic position (education level and participation in the Family Income Program conditional cash transfer program), however only HDI maintained significance for the city-level context. The city-level variance dropped from 0.049 to 0.042, indicating an important between-city effect. Regarding the outcomes performing tests in prenatal care, the worst conditions such as contextual (HDI > 0.694, p < 0.001; Gini index ≥ 0.521, p < 0.001) and individual (> 8 years of schooling, p < 0.001) showed a risk effect in the final model. Variables related to health services did not show significant effects. They were associated with individual socioeconomic position and a city-level contextual effect. These findings indicate the importance of strengthening HIV and syphilis infection control programs during pregnancy.


2020 ◽  
pp. 026101832092964
Author(s):  
Taly Reininger ◽  
Borja Castro-Serrano

Utilizing Foucault’s insights on neoliberalism, his notion of governmentality in relation to the State, and his insights on the processes of subjectivity (2007; 2006) the following article seeks to critically examine Chile’s Ethical Family Income (IEF), a conditional cash transfer program that was implemented in the country from 2011 to 2016. Utilizing interview excerpts with women who participated in the program, the article analyzes the manner in which the program operated as a contemporary form of governmentality by installing a particular production of subjectivity in which meritorious recipients of state aid are shaped as productive, responsible, independent citizens who actively invest in accumulating human capital in order to transform themselves and their children into entrepreneurial individuals. The article concludes discussing possibilities of resistance to neoliberal rationality processes of subjectivation in poverty eradication policies and programs.


Processes ◽  
2019 ◽  
Vol 7 (6) ◽  
pp. 337 ◽  
Author(s):  
Xin Wu ◽  
Yuchen Gao ◽  
Dian Jiao

Non-intrusive load monitoring (NILM) is an effective method to optimize energy consumption patterns. Since the concept of NILM was proposed, extensive research has focused on energy disaggregation or load identification. The traditional method is to disaggregate mixed signals, and then identify the independent load. This paper proposes a multi-label classification method using Random Forest (RF) as a learning algorithm for non-intrusive load identification. Multi-label classification can be used to determine which categories data belong to. This classification can help to identify the operation states of independent loads from mixed signals without disaggregation. The experiments are conducted in real environment and public data set respectively. Several basic electrical features are selected as the classification feature to build the classification model. These features are also compared to select the most suitable features for classification by feature importance parameters. The classification accuracy and F-score of the proposed method can reach 0.97 and 0.98, respectively.


Author(s):  
Jun Pei ◽  
Zheng Zheng ◽  
Hyunji Kim ◽  
Lin Song ◽  
Sarah Walworth ◽  
...  

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>


2020 ◽  
Vol 38 (4A) ◽  
pp. 510-514
Author(s):  
Tay H. Shihab ◽  
Amjed N. Al-Hameedawi ◽  
Ammar M. Hamza

In this paper to make use of complementary potential in the mapping of LULC spatial data is acquired from LandSat 8 OLI sensor images are taken in 2019.  They have been rectified, enhanced and then classified according to Random forest (RF) and artificial neural network (ANN) methods. Optical remote sensing images have been used to get information on the status of LULC classification, and extraction details. The classification of both satellite image types is used to extract features and to analyse LULC of the study area. The results of the classification showed that the artificial neural network method outperforms the random forest method. The required image processing has been made for Optical Remote Sensing Data to be used in LULC mapping, include the geometric correction, Image Enhancements, The overall accuracy when using the ANN methods 0.91 and the kappa accuracy was found 0.89 for the training data set. While the overall accuracy and the kappa accuracy of the test dataset were found 0.89 and 0.87 respectively.


2021 ◽  
Author(s):  
Christian Thiele ◽  
Gerrit Hirschfeld ◽  
Ruth von Brachel

AbstractRegistries of clinical trials are a potential source for scientometric analysis of medical research and serve important functions for the research community and the public at large. Clinical trials that recruit patients in Germany are usually registered in the German Clinical Trials Register (DRKS) or in international registries such as ClinicalTrials.gov. Furthermore, the International Clinical Trials Registry Platform (ICTRP) aggregates trials from multiple primary registries. We queried the DRKS, ClinicalTrials.gov, and the ICTRP for trials with a recruiting location in Germany. Trials that were registered in multiple registries were linked using the primary and secondary identifiers and a Random Forest model based on various similarity metrics. We identified 35,912 trials that were conducted in Germany. The majority of the trials was registered in multiple databases. 32,106 trials were linked using primary IDs, 26 were linked using a Random Forest model, and 10,537 internal duplicates on ICTRP were identified using the Random Forest model after finding pairs with matching primary or secondary IDs. In cross-validation, the Random Forest increased the F1-score from 96.4% to 97.1% compared to a linkage based solely on secondary IDs on a manually labelled data set. 28% of all trials were registered in the German DRKS. 54% of the trials on ClinicalTrials.gov, 43% of the trials on the DRKS and 56% of the trials on the ICTRP were pre-registered. The ratio of pre-registered studies and the ratio of studies that are registered in the DRKS increased over time.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Anelise Andrade de Souza ◽  
Sueli Aparecida Mingoti ◽  
Rômulo Paes-Sousa ◽  
Leo Heller

Abstract Background This study aims to assess the interactive effects of Brazilian public interventions, environmental health programs (access to water, sanitation and solid waste collection) and a Conditional Cash Transfer Program (PBF), on the mortality reduction due to diarrhea and malnutrition among children under 5 years old. Methods The study design is ecological, with longitudinal analysis in a balanced panel. The period covered is 2006 to 2016, including 3467 municipalities from all regions of the country, which resulted in 38,137 observations. The generalized linear models were adjusted considering the Negative Binomial (NB) distribution for the number of deaths due to malnutrition and diarrhea, with fixed effects. NB models with and without zero-inflation were assessed. Subsequent interaction models were applied to assess the combined effects of the two public policies. Results In relation to the decline of mortality rates due to diarrhea in the municipalities, positive effect modification were observed in the presence of: high coverage of the target population by the PBF and access to water, 0.54 (0.28–1.04) / 0.55 (0.29–1.04); high coverage by the total population by the PBF and access to water, 0.97 (0.95–1.00) and high coverage by the total population by the PBF and access to sanitation, 0.98 (0.97–1.00). Decline on diarrhea mortality was also observed in the joint presence of high coverage of solid waste collection and access to water, categories 1 (> 60% ≤85%): 0.98 (0.96–1.00), 0.98 (0.97–1, 00) and 2 (> 85% ≤ 100%): 0.97 (0.95–0.98), 0.97 (0.95–0.99). Negative effect modification were observed for mortality due to malnutrition in the presence of simultaneous high coverage of the total population by the PBF and access to sanitation categories 1 (≥ 20 < 50%): 1.0061 (0.9991–1.0132) and 2 (≥ 50 < 100%): 1.0073 (1.0002–1.0145) and high coverage of the total population by the PBF and solid waste collection, 1.0004 (1.0002–1.0005), resulting in malnutrition mortality rates increase. Conclusion Implementation of environmental health services and the coverage expansion by the PBF may enhance the prevention of early deaths in children under 5 years old due to diarrhea, a poverty related disease.


Sign in / Sign up

Export Citation Format

Share Document