Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos; Georgios Drakopoulos; Andreas Kanavos; Phivos Mylonas; Gerasimos Vonitsanos

doi:10.3390/a13030071

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Algorithms ◽

10.3390/a13030071 ◽

2020 ◽

Vol 13 (3) ◽

pp. 71 ◽

Cited By ~ 4

Author(s):

Athanasios Alexopoulos ◽

Georgios Drakopoulos ◽

Andreas Kanavos ◽

Phivos Mylonas ◽

Gerasimos Vonitsanos

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Task ◽

Data Matrix ◽

Vast Number ◽

Hadoop Ecosystem ◽

Iot Devices ◽

Algorithmic Techniques ◽

The Individual ◽

Value Decomposition

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

Download Full-text

Evaluation of the COVID-19 Era by Using Machine Learning and Interpretation of Confidential Dataset

Electronics ◽

10.3390/electronics10232910 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2910

Author(s):

Andreas Andreou ◽

Constandinos X. Mavromoustakis ◽

George Mastorakis ◽

Jordi Mongay Batalla ◽

Evangelos Pallis

Keyword(s):

Machine Learning ◽

Real Time ◽

Research Study ◽

Nonlinear Least Squares ◽

Data Driven ◽

Machine Learning Technique ◽

Marquardt Algorithm ◽

Learning Technique ◽

Iot Devices ◽

Algorithmic Techniques

Various research approaches to COVID-19 are currently being developed by machine learning (ML) techniques and edge computing, either in the sense of identifying virus molecules or in anticipating the risk analysis of the spread of COVID-19. Consequently, these orientations are elaborating datasets that derive either from WHO, through the respective website and research portals, or from data generated in real-time from the healthcare system. The implementation of data analysis, modelling and prediction processing is performed through multiple algorithmic techniques. The lack of these techniques to generate predictions with accuracy motivates us to proceed with this research study, which elaborates an existing machine learning technique and achieves valuable forecasts by modification. More specifically, this study modifies the Levenberg–Marquardt algorithm, which is commonly beneficial for approaching solutions to nonlinear least squares problems, endorses the acquisition of data driven from IoT devices and analyses these data via cloud computing to generate foresight about the progress of the outbreak in real-time environments. Hence, we enhance the optimization of the trend line that interprets these data. Therefore, we introduce this framework in conjunction with a novel encryption process that we are proposing for the datasets and the implementation of mortality predictions.

Download Full-text

A survey of open source tools for machine learning with big data in the Hadoop ecosystem

Journal Of Big Data ◽

10.1186/s40537-015-0032-1 ◽

2015 ◽

Vol 2 (1) ◽

Cited By ~ 179

Author(s):

Sara Landset ◽

Taghi M. Khoshgoftaar ◽

Aaron N. Richter ◽

Tawfiq Hasanin

Keyword(s):

Machine Learning ◽

Big Data ◽

Open Source ◽

Hadoop Ecosystem

Download Full-text

Blockchain-Based Continued Integrity Service for IoT Big Data Management: A Comprehensive Design

Electronics ◽

10.3390/electronics9091434 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1434

Author(s):

Yustus Eko Oktian ◽

Sang-Gon Lee ◽

Byung-Gook Lee

Keyword(s):

Big Data ◽

Data Management ◽

Resource Sharing ◽

Service Level Agreement ◽

Building Blocks ◽

Service Level ◽

Third Party ◽

Vast Number ◽

Iot Devices ◽

In Transit

The state-of-the-art centralized Internet of Things (IoT) data flow pipeline has started aging since it cannot cope with the vast number of newly connected IoT devices. As a result, the community begins the transition to a decentralized pipeline to encourage data and resource sharing. However, the move is not trivial. With many instances allocating data or service arbitrarily, how can we guarantee the correctness of IoT data or processes that other parties offer. Furthermore, in case of dispute, how can the IoT data assist in determining which party is guilty of faulty behavior. Finally, the number of Service Level Agreement (SLA) increases as the number of sharing grows. The problem then becomes how we can provide a natural SLA generation and verification that we can automate instead of going through a manual and tedious legalization process through a trusted third party. In this paper, we explore blockchain solutions to answer those issues and propose continued data integrity services for IoT big data management. Specifically, we design five integrity protocols across three phases of IoT operations—during the transmission of IoT data (data in transit), when we physically store the data in the database (data at rest), and at the time of data processing (data in process). In each phase, we first lay out our motivations and survey the related blockchain solutions from the literature. We then use curated papers from our surveys as building blocks in designing the protocol. Using our proposal, we augment the overall value of IoT data and commands, generated in the IoT system, as they are now tamper-proof, verifiable, non-repudiable, and more robust.

Download Full-text

Should we be afraid of medical AI?

Journal of Medical Ethics ◽

10.1136/medethics-2018-105281 ◽

2019 ◽

Vol 45 (8) ◽

pp. 556-558 ◽

Cited By ~ 6

Author(s):

Ezio Di Nucci

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Decision Making ◽

Big Data ◽

Patient Autonomy ◽

Personalised Medicine ◽

Evidence Based ◽

Individual Values ◽

Ibm Watson ◽

The Individual

I analyse an argument according to which medical artificial intelligence (AI) represents a threat to patient autonomy—recently put forward by Rosalind McDougall in the Journal of Medical Ethics. The argument takes the case of IBM Watson for Oncology to argue that such technologies risk disregarding the individual values and wishes of patients. I find three problems with this argument: (1) it confuses AI with machine learning; (2) it misses machine learning’s potential for personalised medicine through big data; (3) it fails to distinguish between evidence-based advice and decision-making within healthcare. I conclude that how much and which tasks we should delegate to machine learning and other technologies within healthcare and beyond is indeed a crucial question of our time, but in order to answer it, we must be careful in analysing and properly distinguish between the different systems and different delegated tasks.

Download Full-text

Big Data Classification and Internet of Things in Healthcare

10.4018/978-1-6684-3662-2.ch071 ◽

2022 ◽

pp. 1458-1476

Author(s):

Amine Rghioui ◽

Jaime Lloret ◽

Abedlmajid Oumnad

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Data Classification ◽

Big Data Analysis ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Big Data Classification ◽

Iot Devices ◽

Effective Decision Making

Every single day, a massive amount of data is generated by different medical data sources. Processing this wealth of data is indeed a daunting task, and it forces us to adopt smart and scalable computational strategies, including machine intelligence, big data analytics, and data classification. The authors can use the Big Data analysis for effective decision making in healthcare domain using the existing machine learning algorithms with some modification to it. The fundamental purpose of this article is to summarize the role of Big Data analysis in healthcare, and to provide a comprehensive analysis of the various techniques involved in mining big data. This article provides an overview of Big Data, applicability of it in healthcare, some of the work in progress and a future works. Therefore, in this article, the use of machine learning techniques is proposed for real-time diabetic patient data analysis from IoT devices and gateways.

Download Full-text

Machine Learning of Concepts Hard Even for Humans: The Case of Online Depression Forums

International Journal of Qualitative Methods ◽

10.1177/1609406920949338 ◽

2020 ◽

Vol 19 ◽

pp. 160940692094933 ◽

Cited By ~ 1

Author(s):

Renáta Németh ◽

Domonkos Sik ◽

Fanni Máté

Keyword(s):

Machine Learning ◽

Big Data ◽

Language Processing ◽

Mixed Methods Research ◽

Semantic Interpretation ◽

Supervised Machine Learning ◽

Online Forums ◽

Automated Method ◽

The Future ◽

The Individual

Social scientists of mixed-methods research have traditionally used human annotators to classify texts according to some predefined knowledge. The “big data” revolution, the fast growth of digitized texts in recent years brings new opportunities but also new challenges. In our research project, we aim to examine the potential for natural language processing (NLP) techniques to understand the individual framing of depression in online forums. In this paper, we introduce a part of this project experimenting with NLP classification (supervised machine learning) method, which is capable of classifying large digital corpora according to various discourses on depression. Our question was whether an automated method can be applied to sociological problems outside the scope of hermeneutically more trivial business applications. The present article introduces our learning path from the difficulties of human annotation to the hermeneutic limitations of algorithmic NLP methods. We faced our first failure when we experienced significant inter-annotator disagreement. In response to the failure, we moved to the strategy of intersubjective hermeneutics (interpretation through consensus). The second failure arose because we expected the machine to effectively learn from the human-annotated sample despite its hermeneutic limitations. The machine learning seemed to work appropriately in predicting bio-medical and psychological framing, but it failed in case of sociological framing. These results show that the sociological discourse about depression is not as well founded as the biomedical and the psychological discourses—a conclusion which requires further empirical study in the future. An increasing part of machine learning solution is based on human annotation of semantic interpretation tasks, and such human-machine interactions will probably define many more applications in the future. Our paper shows the hermeneutic limitations of “big data” text analytics in the social sciences, and highlights the need for a better understanding of the use of annotated textual data and the annotation process itself.

Download Full-text

On-the-Go Network Establishment of IoT Devices to Meet the Need of Processing Big Data Using Machine Learning Algorithms

Business Intelligence for Enterprise Internet of Things - EAI/Springer Innovations in Communication and Computing ◽

10.1007/978-3-030-44407-5_8 ◽

2020 ◽

pp. 151-168

Author(s):

S. Sountharrajan ◽

E. Suganya ◽

M. Karthiga ◽

S. S. Nandhini ◽

B. Vishnupriya ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Iot Devices

Download Full-text

Big Data Matrix Singular Value Decomposition Based on Low-Rank Tensor Train Decomposition

Advances in Neural Networks – ISNN 2014 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-12436-0_14 ◽

2014 ◽

pp. 121-130 ◽

Cited By ~ 2

Author(s):

Namgil Lee ◽

Andrzej Cichocki

Keyword(s):

Big Data ◽

Singular Value Decomposition ◽

Singular Value ◽

Low Rank ◽

Data Matrix ◽

Rank Tensor ◽

Value Decomposition

Download Full-text

Cricket Match Analytics Using the Big Data Approach

Electronics ◽

10.3390/electronics10192350 ◽

2021 ◽

Vol 10 (19) ◽

pp. 2350

Author(s):

Mazhar Javed Awan ◽

Syed Arbaz Haider Gilani ◽

Hamza Ramzan ◽

Haitham Nobanee ◽

Awais Yasin ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Linear Regression ◽

Mean Square Error ◽

Selection Process ◽

Big Data Analytics ◽

Absolute Error ◽

Mean Square ◽

Data Framework ◽

The Individual

Cricket is one of the most liked, played, encouraged, and exciting sports in today’s time that requires a proper advancement with machine learning and artificial intelligence (AI) to attain more accuracy. With the increasing number of matches with time, the data related to cricket matches and the individual player are increasing rapidly. Moreover, the need of using big data analytics and the opportunities of utilizing this big data effectively in many beneficial ways are also increasing, such as the selection process of players in the team, predicting the winner of the match, and many more future predictions using some machine learning models or big data techniques. We applied the machine learning linear regression model to predict the team scores without big data and the big data framework Spark ML. The experimental results are measured through accuracy, the root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE), respectively 95%, 30.2, 1350.34, and 28.2 after applying linear regression in Spark ML. Furthermore, our approach can be applied to other sports.

Download Full-text

Advanced Interpretable Machine Learning Methods for Clinical NGS Big Data of Complex Hereditary Diseases

10.3389/978-2-88966-274-6 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Hereditary Diseases ◽

Learning Methods ◽

Machine Learning Methods ◽

Interpretable Machine Learning

Download Full-text