The Northeast Big Data Innovation Hub: "Enabling Seamless Data Sharing in Industry and Academia" Workshop Report

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Business Intelligence ◽

10.4018/978-1-4666-9562-7.ch062 ◽

2016 ◽

pp. 1220-1243

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text

An Enhanced Framework To Secure Big Data Based on Hybrid Machine Learning Technique:ANN-PSO

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f5385.039621 ◽

2021 ◽

Vol 9 (6) ◽

pp. 76-84

Author(s):

Salim Raza Qureshi

Keyword(s):

Big Data ◽

Particle Swarm Optimization ◽

Optimization Algorithm ◽

Particle Swarm Optimization Algorithm ◽

Particle Swarm ◽

Smart Devices ◽

Data Sets ◽

Academic Impact ◽

Swarm Optimization ◽

Data Source

With the advancement of smart devices and cloud computing, more and more public health data can be collected from various sources and analyzed in unprecedented ways. The enormous social and academic impact of this development has led to a global buzz for bigdata. Moreover, due to the massive data source, the security of big data in the cloud is becoming an important issue. In these days, various issues have arisen in the field of big data security, such as Infrastructure security, data confidentiality, data management and data integrity. In this paper, we propose a novel technique based on Artificial Neural Network-and Particle Swarm Optimization Algorithm (ANNPSO) for enabling a highly secured framework. The ANN-PSO method was created to predict health status from a database and its functions were selected from these data sets. The particle swarm optimization algorithm matches the ANN for better results by reducing errors. The results show the potential of the ANNPSO-based methodology for satisfactory health prediction results. This proposed approach will be tested using large medical data in a Hadoop environment. The proposed work will be carried out in the JAVA work phase.

Download Full-text

Overview of Big Data and Its Visualization

10.4018/978-1-6684-3662-2.ch002 ◽

2022 ◽

pp. 22-53

Author(s):

Richard S. Segall ◽

Gao Niu

Keyword(s):

Big Data ◽

Traffic Safety ◽

Data Analytics ◽

Big Data Analytics ◽

Real Data ◽

Flow Diagram ◽

Data Sets ◽

United States Department ◽

Big Data Visualization ◽

Challenges And Opportunities

Big Data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. This chapter discusses what Big Data is and its characteristics, and how this information revolution of Big Data is transforming our lives and the new technology and methodologies that have been developed to process data of these huge dimensionalities. This chapter discusses the components of the Big Data stack interface, categories of Big Data analytics software and platforms, descriptions of the top 20 Big Data analytics software. Big Data visualization techniques are discussed with real data from fatality analysis reporting system (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of the United States Department of Transportation. Big Data web-based visualization software are discussed that are both JavaScript-based and user-interface-based. This chapter also discusses the challenges and opportunities of using Big Data and presents a flow diagram of the 30 chapters within this handbook.

Download Full-text

Verification of statistical oncological endpoints on encrypted data: Confirming the feasibility of real-world data sharing without the need to reveal protected patient information.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.e18725 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. e18725-e18725

Author(s):

Ravit Geva ◽

Barliz Waissengrin ◽

Dan Mirelman ◽

Felix Bokstein ◽

Deborah T. Blumenthal ◽

...

Keyword(s):

Data Sharing ◽

Real World ◽

Clinical Decision Making ◽

Homomorphic Encryption ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Raw Data ◽

Encrypted Data ◽

World Data

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]

Download Full-text

An Ecosystem Perspective On Asset Management Information

Management Systems in Production Engineering ◽

10.1515/mspe-2017-0022 ◽

2017 ◽

Vol 25 (3) ◽

pp. 150-157

Author(s):

Lasse Metso ◽

Mirka Kans

Keyword(s):

Big Data ◽

Internet Of Things ◽

Data Sharing ◽

Asset Management ◽

Interview Study ◽

Management Information ◽

Management Actions ◽

Railway Industry ◽

Business Opportunities ◽

Access To Data

AbstractBig Data and Internet of Things will increase the amount of data on asset management exceedingly. Data sharing with an increased number of partners in the area of asset management is important when developing business opportunities and new ecosystems. An asset management ecosystem is a complex set of relationships between parties taking part in asset management actions. In this paper, the current barriers and benefits of data sharing are identified based on the results of an interview study. The main benefits are transparency, access to data and reuse of data. New services can be created by taking advantage of data sharing. The main barriers to sharing data are an unclear view of the data sharing process and difficulties to recognize the benefits of data sharing. For overcoming the barriers in data sharing, this paper applies the ecosystem perspective on asset management information. The approach is explained by using the Swedish railway industry as an example.

Download Full-text

BIG DATA ANALYSIS IN HEALTH CARE DOMAIN: A SYSTEMATIC REVIEW

International Journal of Engineering Technologies and Management Research ◽

10.29121/ijetmr.v5.i2.2018.605 ◽

2020 ◽

Vol 5 (2) ◽

pp. 1-8

Author(s):

Abhishek Bajpai ◽

Dr. Sanjiv Sharma

Keyword(s):

Health Care ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Healthcare Data ◽

Challenges And Opportunities ◽

Business Profit

As the Volume of the data produced is increasing day by day in our society, the exploration of big data in healthcare is increasing at an unprecedented rate. Now days, Big data is very popular buzzword concept in the various areas. This paper provide an effort is made to established that even the healthcare industries are stepping into big data pool to take all advantages from its various advanced tools and technologies. This paper provides the review of various research disciplines made in health care realm using big data approaches and methodologies. Big data methodologies can be used for the healthcare data analytics (which consist 4 V’s) which provide the better decision to accelerate the business profit and customer affection, acquire a better understanding of market behaviours and trends and to provide E-Health services using Digital imaging and communication in Medicine (DICOM).Big data Techniques like Map Reduce, Machine learning can be applied to develop system for early diagnosis of disease, i.e. analysis of the chronic disease like- heart disease, diabetes and stroke. The analysis on the data is performed using big data analytics framework Hadoop. Hadoop framework is used to process large data sets Further the paper present the various Big data tools , challenges and opportunities and various hurdles followed by the conclusion.

Download Full-text

Overview of Big Data and Its Visualization

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch001 ◽

2018 ◽

pp. 1-32 ◽

Cited By ~ 3

Author(s):

Richard S. Segall ◽

Gao Niu

Keyword(s):

Big Data ◽

Traffic Safety ◽

Data Analytics ◽

Big Data Analytics ◽

Real Data ◽

Flow Diagram ◽

Data Sets ◽

United States Department ◽

Big Data Visualization ◽

Challenges And Opportunities

Big Data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. This chapter discusses what Big Data is and its characteristics, and how this information revolution of Big Data is transforming our lives and the new technology and methodologies that have been developed to process data of these huge dimensionalities. This chapter discusses the components of the Big Data stack interface, categories of Big Data analytics software and platforms, descriptions of the top 20 Big Data analytics software. Big Data visualization techniques are discussed with real data from fatality analysis reporting system (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of the United States Department of Transportation. Big Data web-based visualization software are discussed that are both JavaScript-based and user-interface-based. This chapter also discusses the challenges and opportunities of using Big Data and presents a flow diagram of the 30 chapters within this handbook.

Download Full-text

Causal Discovery from Nonstationary/Heterogeneous Data: Skeleton Estimation and Orientation Determination

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/187 ◽

2017 ◽

Cited By ~ 8

Author(s):

Kun Zhang ◽

Biwei Huang ◽

Jiji Zhang ◽

Clark Glymour ◽

Bernhard Schölkopf

Keyword(s):

Causal Model ◽

Causal Structure ◽

Heterogeneous Data ◽

Causal Discovery ◽

Data Sets ◽

Real World Data ◽

Experimental Conditions ◽

Challenges And Opportunities ◽

Changes Over Time ◽

Orientation Determination

It is commonplace to encounter nonstationary or heterogeneous data, of which the underlying generating process changes over time or across data sets (the data sets may have different experimental conditions or data collection conditions). Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from Nonstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independence changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.

Download Full-text

What the 3Vs Acronym Didn't Put Into Perspective?

Big Data Analytics for Entrepreneurial Success - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-5225-7609-9.ch002 ◽

2019 ◽

pp. 28-60

Keyword(s):

Big Data ◽

Data Analysis ◽

Data Quality ◽

Real Time ◽

Data Sharing ◽

Dynamic Processes ◽

Data Sets ◽

Multiple Sources ◽

Evolving Data ◽

Integrate Data

Data sizes have been growing exponentially within many companies. Facing this size of data—meta tagged piecemeal, produced in real-time, and arrives in continuous streams from multiple sources—analyzing the data to spot patterns and extract useful information is harder still. This includes the ever-changing landscape of data and their associated characteristics, evolving data analysis paradigms, challenges of computational infrastructure, data quality, complexity, and protection in addition to the data sharing and access, and—crucially—our ability to integrate data sets and their analysis toward an improved understanding. In this context, this second chapter will cover the issues and challenges that are hiding behind the 3Vs phenomenon. It gives a platform to complete the first chapter and proceed to different big data issues and challenges and how to tackle them in the dynamic processes.

Download Full-text

Applying the K-Means Algorithm in Big Raw Data Sets with Hadoop and MapReduce

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch002 ◽

2013 ◽

pp. 23-46

Author(s):

Ilias K. Savvas ◽

Georgia N. Sofianidou ◽

M-Tahar Kechadi

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

File System ◽

Large Data ◽

Large Data Sets ◽

Distributed File System ◽

Data Sets ◽

Raw Data ◽

Hadoop Distributed File System ◽

Access To Data

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.

Download Full-text