Development and Evaluation of a Big Data Framework for Performance Management in Mobile Networks

Efficient processing of complex XSD using Hive and Spark

PeerJ Computer Science ◽

10.7717/peerj-cs.652 ◽

2021 ◽

Vol 7 ◽

pp. e652

Author(s):

Diana Martinez-Mosquera ◽

Rosa Navarrete ◽

Sergio Luján-Mora

Keyword(s):

Big Data ◽

Performance Management ◽

Mobile Networks ◽

Real Life ◽

Real Data ◽

Xml Schema ◽

Apache Spark ◽

Data Sets ◽

Apache Hive

The eXtensible Markup Language (XML) files are widely used by the industry due to their flexibility in representing numerous kinds of data. Multiple applications such as financial records, social networks, and mobile networks use complex XML schemas with nested types, contents, and/or extension bases on existing complex elements or large real-world files. A great number of these files are generated each day and this has influenced the development of Big Data tools for their parsing and reporting, such as Apache Hive and Apache Spark. For these reasons, multiple studies have proposed new techniques and evaluated the processing of XML files with Big Data systems. However, a more usual approach in such works involves the simplest XML schemas, even though, real data sets are composed of complex schemas. Therefore, to shed light on complex XML schema processing for real-life applications with Big Data tools, we present an approach that combines three techniques. This comprises three main methods for parsing XML files: cataloging, deserialization, and positional explode. For cataloging, the elements of the XML schema are mapped into root, arrays, structures, values, and attributes. Based on these elements, the deserialization and positional explode are straightforwardly implemented. To demonstrate the validity of our proposal, we develop a case study by implementing a test environment to illustrate the methods using real data sets provided from performance management of two mobile network vendors. Our main results state the validity of the proposed method for different versions of Apache Hive and Apache Spark, obtain the query execution times for Apache Hive internal and external tables and Apache Spark data frames, and compare the query performance in Apache Hive with that of Apache Spark. Another contribution made is a case study in which a novel solution is proposed for data analysis in the performance management systems of mobile networks.

Download Full-text

The Innovation of College and University Performance Management in Big Data Era

2018 International Conference on Social Sciences, Education and Management (SOCSEM 2018) ◽

10.25236/socsem.2018.252 ◽

2018 ◽

Keyword(s):

Big Data ◽

Performance Management ◽

College And University ◽

University Performance

Download Full-text

BDF-SDN: A Big Data Framework for DDoS Attack Detection in Large-Scale SDN-Based Cloud

2021 IEEE Conference on Dependable and Secure Computing (DSC) ◽

10.1109/dsc49826.2021.9346269 ◽

2021 ◽

Author(s):

Phuc Trinh Dinh ◽

Minho Park

Keyword(s):

Big Data ◽

Large Scale ◽

Attack Detection ◽

Ddos Attack ◽

Data Framework ◽

Ddos Attack Detection

Download Full-text

Big data framework for national E-governance plan

2013 Eleventh International Conference on ICT and Knowledge Engineering ◽

10.1109/ictke.2013.6756283 ◽

2013 ◽

Cited By ~ 11

Author(s):

M. R Rajagopalan ◽

Solaimurugan Vellaipandiyan

Keyword(s):

Big Data ◽

Data Framework

Download Full-text

Predictors of outpatients’ no-show: big data analytics using apache spark

Journal Of Big Data ◽

10.1186/s40537-020-00384-9 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Tahani Daghistani ◽

Huda AlGhamdi ◽

Riyad Alshammari ◽

Raed H. AlHazme

Keyword(s):

Machine Learning ◽

Big Data ◽

Negative Impact ◽

Big Data Analytics ◽

Quality Of Healthcare ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Healthcare Organizations ◽

Data Framework ◽

Learning Techniques

AbstractOutpatients who fail to attend their appointments have a negative impact on the healthcare outcome. Thus, healthcare organizations facing new opportunities, one of them is to improve the quality of healthcare. The main challenges is predictive analysis using techniques capable of handle the huge data generated. We propose a big data framework for identifying subject outpatients’ no-show via feature engineering and machine learning (MLlib) in the Spark platform. This study evaluates the performance of five machine learning techniques, using the (2,011,813‬) outpatients’ visits data. Conducting several experiments and using different validation methods, the Gradient Boosting (GB) performed best, resulting in an increase of accuracy and ROC to 79% and 81%, respectively. In addition, we showed that exploring and evaluating the performance of the machine learning models using various evaluation methods is critical as the accuracy of prediction can significantly differ. The aim of this paper is exploring factors that affect no-show rate and can be used to formulate predictions using big data machine learning techniques.

Download Full-text

EMPOWERING, a Smart Big Data Framework for Sustainable Electricity Suppliers

IEEE Access ◽

10.1109/access.2018.2881413 ◽

2018 ◽

Vol 6 ◽

pp. 71132-71142

Author(s):

Gerard Mor ◽

Jordi Vilaplana ◽

Stoyan Danov ◽

Jordi Cipriano ◽

Francesc Solsona ◽

...

Keyword(s):

Big Data ◽

Data Framework ◽

Sustainable Electricity

Download Full-text

Major Big Data Challenges in Most Industries and Innovative Solutions

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1143.1292s219 ◽

2019 ◽

Vol 9 (2S2) ◽

pp. 424-428

Keyword(s):

Big Data ◽

Mobile Networks ◽

Spatial Data ◽

High Volume ◽

High Definition ◽

Management Tools ◽

Social Media Networks ◽

Massive Growth ◽

Multiple Challenges ◽

High Definition Tv

The term “Big data” refers to “the high volume of data sets that are relatively complex in nature and having challenges in processing and analyzing the data using conventional database management tools”. In the digital universe, the data volume and variety that, we deal today have grown-up massively from different sources such as Business Informatics, Social-Media Networks, Images from High Definition TV, data from Mobile Networks, Banking data from ATM Machines, Genomics and GPS Trails, Telemetry from automobiles, Meteorology, Financial market data etc. Data Scientists confirm that 80% of the data that we have gathered today are in unstructured format, i.e. in the form of images, pixel data, Videos, geo-spatial data, PDF files etc. Because of the massive growth of data and its different formats, organizations are having multiple challenges in capturing, storing, mining, analyzing, and visualizing the Big data. This paper aims to exemplify the key challenges faced by most organizations and the significance of implementing the emerging Big data techniques for effective extraction of business intelligence to make better and faster decisions

Download Full-text

SIDELOADING – INGESTION OF LARGE POINT CLOUDS INTO THE APACHE SPARK BIG DATA ENGINE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xli-b2-343-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 343-348

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text