scholarly journals Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Algorithms ◽  
2020 ◽  
Vol 13 (3) ◽  
pp. 71 ◽  
Author(s):  
Athanasios Alexopoulos ◽  
Georgios Drakopoulos ◽  
Andreas Kanavos ◽  
Phivos Mylonas ◽  
Gerasimos Vonitsanos

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

Electronics ◽  
2021 ◽  
Vol 10 (23) ◽  
pp. 2910
Author(s):  
Andreas Andreou ◽  
Constandinos X. Mavromoustakis ◽  
George Mastorakis ◽  
Jordi Mongay Batalla ◽  
Evangelos Pallis

Various research approaches to COVID-19 are currently being developed by machine learning (ML) techniques and edge computing, either in the sense of identifying virus molecules or in anticipating the risk analysis of the spread of COVID-19. Consequently, these orientations are elaborating datasets that derive either from WHO, through the respective website and research portals, or from data generated in real-time from the healthcare system. The implementation of data analysis, modelling and prediction processing is performed through multiple algorithmic techniques. The lack of these techniques to generate predictions with accuracy motivates us to proceed with this research study, which elaborates an existing machine learning technique and achieves valuable forecasts by modification. More specifically, this study modifies the Levenberg–Marquardt algorithm, which is commonly beneficial for approaching solutions to nonlinear least squares problems, endorses the acquisition of data driven from IoT devices and analyses these data via cloud computing to generate foresight about the progress of the outbreak in real-time environments. Hence, we enhance the optimization of the trend line that interprets these data. Therefore, we introduce this framework in conjunction with a novel encryption process that we are proposing for the datasets and the implementation of mortality predictions.


2015 ◽  
Vol 2 (1) ◽  
Author(s):  
Sara Landset ◽  
Taghi M. Khoshgoftaar ◽  
Aaron N. Richter ◽  
Tawfiq Hasanin

Electronics ◽  
2020 ◽  
Vol 9 (9) ◽  
pp. 1434
Author(s):  
Yustus Eko Oktian ◽  
Sang-Gon Lee ◽  
Byung-Gook Lee

The state-of-the-art centralized Internet of Things (IoT) data flow pipeline has started aging since it cannot cope with the vast number of newly connected IoT devices. As a result, the community begins the transition to a decentralized pipeline to encourage data and resource sharing. However, the move is not trivial. With many instances allocating data or service arbitrarily, how can we guarantee the correctness of IoT data or processes that other parties offer. Furthermore, in case of dispute, how can the IoT data assist in determining which party is guilty of faulty behavior. Finally, the number of Service Level Agreement (SLA) increases as the number of sharing grows. The problem then becomes how we can provide a natural SLA generation and verification that we can automate instead of going through a manual and tedious legalization process through a trusted third party. In this paper, we explore blockchain solutions to answer those issues and propose continued data integrity services for IoT big data management. Specifically, we design five integrity protocols across three phases of IoT operations—during the transmission of IoT data (data in transit), when we physically store the data in the database (data at rest), and at the time of data processing (data in process). In each phase, we first lay out our motivations and survey the related blockchain solutions from the literature. We then use curated papers from our surveys as building blocks in designing the protocol. Using our proposal, we augment the overall value of IoT data and commands, generated in the IoT system, as they are now tamper-proof, verifiable, non-repudiable, and more robust.


2019 ◽  
Vol 45 (8) ◽  
pp. 556-558 ◽  
Author(s):  
Ezio Di Nucci

I analyse an argument according to which medical artificial intelligence (AI) represents a threat to patient autonomy—recently put forward by Rosalind McDougall in the Journal of Medical Ethics. The argument takes the case of IBM Watson for Oncology to argue that such technologies risk disregarding the individual values and wishes of patients. I find three problems with this argument: (1) it confuses AI with machine learning; (2) it misses machine learning’s potential for personalised medicine through big data; (3) it fails to distinguish between evidence-based advice and decision-making within healthcare. I conclude that how much and which tasks we should delegate to machine learning and other technologies within healthcare and beyond is indeed a crucial question of our time, but in order to answer it, we must be careful in analysing and properly distinguish between the different systems and different delegated tasks.


2022 ◽  
pp. 1458-1476
Author(s):  
Amine Rghioui ◽  
Jaime Lloret ◽  
Abedlmajid Oumnad

Every single day, a massive amount of data is generated by different medical data sources. Processing this wealth of data is indeed a daunting task, and it forces us to adopt smart and scalable computational strategies, including machine intelligence, big data analytics, and data classification. The authors can use the Big Data analysis for effective decision making in healthcare domain using the existing machine learning algorithms with some modification to it. The fundamental purpose of this article is to summarize the role of Big Data analysis in healthcare, and to provide a comprehensive analysis of the various techniques involved in mining big data. This article provides an overview of Big Data, applicability of it in healthcare, some of the work in progress and a future works. Therefore, in this article, the use of machine learning techniques is proposed for real-time diabetic patient data analysis from IoT devices and gateways.


2020 ◽  
Vol 19 ◽  
pp. 160940692094933 ◽  
Author(s):  
Renáta Németh ◽  
Domonkos Sik ◽  
Fanni Máté

Social scientists of mixed-methods research have traditionally used human annotators to classify texts according to some predefined knowledge. The “big data” revolution, the fast growth of digitized texts in recent years brings new opportunities but also new challenges. In our research project, we aim to examine the potential for natural language processing (NLP) techniques to understand the individual framing of depression in online forums. In this paper, we introduce a part of this project experimenting with NLP classification (supervised machine learning) method, which is capable of classifying large digital corpora according to various discourses on depression. Our question was whether an automated method can be applied to sociological problems outside the scope of hermeneutically more trivial business applications. The present article introduces our learning path from the difficulties of human annotation to the hermeneutic limitations of algorithmic NLP methods. We faced our first failure when we experienced significant inter-annotator disagreement. In response to the failure, we moved to the strategy of intersubjective hermeneutics (interpretation through consensus). The second failure arose because we expected the machine to effectively learn from the human-annotated sample despite its hermeneutic limitations. The machine learning seemed to work appropriately in predicting bio-medical and psychological framing, but it failed in case of sociological framing. These results show that the sociological discourse about depression is not as well founded as the biomedical and the psychological discourses—a conclusion which requires further empirical study in the future. An increasing part of machine learning solution is based on human annotation of semantic interpretation tasks, and such human-machine interactions will probably define many more applications in the future. Our paper shows the hermeneutic limitations of “big data” text analytics in the social sciences, and highlights the need for a better understanding of the use of annotated textual data and the annotation process itself.


Electronics ◽  
2021 ◽  
Vol 10 (19) ◽  
pp. 2350
Author(s):  
Mazhar Javed Awan ◽  
Syed Arbaz Haider Gilani ◽  
Hamza Ramzan ◽  
Haitham Nobanee ◽  
Awais Yasin ◽  
...  

Cricket is one of the most liked, played, encouraged, and exciting sports in today’s time that requires a proper advancement with machine learning and artificial intelligence (AI) to attain more accuracy. With the increasing number of matches with time, the data related to cricket matches and the individual player are increasing rapidly. Moreover, the need of using big data analytics and the opportunities of utilizing this big data effectively in many beneficial ways are also increasing, such as the selection process of players in the team, predicting the winner of the match, and many more future predictions using some machine learning models or big data techniques. We applied the machine learning linear regression model to predict the team scores without big data and the big data framework Spark ML. The experimental results are measured through accuracy, the root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE), respectively 95%, 30.2, 1350.34, and 28.2 after applying linear regression in Spark ML. Furthermore, our approach can be applied to other sports.


Sign in / Sign up

Export Citation Format

Share Document