The Northeast Big Data Innovation Hub: "Enabling Seamless Data Sharing in Industry and Academia" Workshop Report

2017 ◽  
Author(s):  
Jane Greenberg ◽  
◽  
Samantha Grabus ◽  
Florence Hudson ◽  
Tim Kraska ◽  
...  

Increasingly, both industry and academia, in fields ranging from biology and social sciences to computing and engineering, are driven by data (Provost & Fawcett, 2013; Wixom, et al, 2014); and both commercial success and academic impact are dependent on having access to data. Many organizations collecting data lack the expertise required to process it (Hazen, et al, 2014), and, thus, pursue data sharing with researchers who can extract more value from data they own. For example, a biosciences company may benefit from a specific analysis technique a researcher has developed. At the same time, researchers are always on the search for real-world data sets to demonstrate the effectiveness of their methods. Unfortunately, many data sharing attempts fail, for reasons ranging from legal restrictions on how data can be used—to privacy policies, different cultural norms, and technological barriers. In fact, many data sharing partnerships that are vital to addressing pressing societal challenges in cities, health, energy, and the environment are not being pursued due to such obstacles. Addressing these data sharing challenges requires open, supportive dialogue across many sectors, including technology, policy, industry, and academia. Further, there is a crucial need for well-defined agreements that can be shared among key stakeholders, including researchers, technologists, legal representatives, and technology transfer officers. The Northeast Big Data Innovation Hub (NEBDIH) took an important step in this area with the recent "Enabling Seamless Data Sharing in Industry and Academia" workshop, held at Drexel University September 29-30, 2016. The workshop brought together representatives from these critical stakeholder communities to launch a national dialogue on challenges and opportunities in this complex space.

2016 ◽  
pp. 1220-1243
Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


Author(s):  
Salim Raza Qureshi

With the advancement of smart devices and cloud computing, more and more public health data can be collected from various sources and analyzed in unprecedented ways. The enormous social and academic impact of this development has led to a global buzz for bigdata. Moreover, due to the massive data source, the security of big data in the cloud is becoming an important issue. In these days, various issues have arisen in the field of big data security, such as Infrastructure security, data confidentiality, data management and data integrity. In this paper, we propose a novel technique based on Artificial Neural Network-and Particle Swarm Optimization Algorithm (ANNPSO) for enabling a highly secured framework. The ANN-PSO method was created to predict health status from a database and its functions were selected from these data sets. The particle swarm optimization algorithm matches the ANN for better results by reducing errors. The results show the potential of the ANNPSO-based methodology for satisfactory health prediction results. This proposed approach will be tested using large medical data in a Hadoop environment. The proposed work will be carried out in the JAVA work phase.


2022 ◽  
pp. 22-53
Author(s):  
Richard S. Segall ◽  
Gao Niu

Big Data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. This chapter discusses what Big Data is and its characteristics, and how this information revolution of Big Data is transforming our lives and the new technology and methodologies that have been developed to process data of these huge dimensionalities. This chapter discusses the components of the Big Data stack interface, categories of Big Data analytics software and platforms, descriptions of the top 20 Big Data analytics software. Big Data visualization techniques are discussed with real data from fatality analysis reporting system (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of the United States Department of Transportation. Big Data web-based visualization software are discussed that are both JavaScript-based and user-interface-based. This chapter also discusses the challenges and opportunities of using Big Data and presents a flow diagram of the 30 chapters within this handbook.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e18725-e18725
Author(s):  
Ravit Geva ◽  
Barliz Waissengrin ◽  
Dan Mirelman ◽  
Felix Bokstein ◽  
Deborah T. Blumenthal ◽  
...  

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]


2017 ◽  
Vol 25 (3) ◽  
pp. 150-157
Author(s):  
Lasse Metso ◽  
Mirka Kans

AbstractBig Data and Internet of Things will increase the amount of data on asset management exceedingly. Data sharing with an increased number of partners in the area of asset management is important when developing business opportunities and new ecosystems. An asset management ecosystem is a complex set of relationships between parties taking part in asset management actions. In this paper, the current barriers and benefits of data sharing are identified based on the results of an interview study. The main benefits are transparency, access to data and reuse of data. New services can be created by taking advantage of data sharing. The main barriers to sharing data are an unclear view of the data sharing process and difficulties to recognize the benefits of data sharing. For overcoming the barriers in data sharing, this paper applies the ecosystem perspective on asset management information. The approach is explained by using the Swedish railway industry as an example.


Author(s):  
Abhishek Bajpai ◽  
Dr. Sanjiv Sharma

As the Volume of the data produced is increasing day by day in our society, the exploration of big data in healthcare is increasing at an unprecedented rate. Now days, Big data is very popular buzzword concept in the various areas. This paper provide an effort is made to established that even the healthcare industries are stepping into big data pool to take all advantages from its various advanced tools and technologies. This paper provides the review of various research disciplines made in health care realm using big data approaches and methodologies. Big data methodologies can be used for the healthcare data analytics (which consist 4 V’s) which provide the better decision to accelerate the business profit and customer affection, acquire a better understanding of market behaviours and trends and to provide E-Health services using Digital imaging and communication in Medicine (DICOM).Big data Techniques like Map Reduce, Machine learning can be applied to develop system for early diagnosis of disease, i.e. analysis of the chronic disease like- heart disease, diabetes and stroke. The analysis on the data is performed using big data analytics framework Hadoop. Hadoop framework is used to process large data sets Further the paper present the various Big data tools , challenges and opportunities and various hurdles followed by the conclusion.                                      


Author(s):  
Richard S. Segall ◽  
Gao Niu

Big Data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. This chapter discusses what Big Data is and its characteristics, and how this information revolution of Big Data is transforming our lives and the new technology and methodologies that have been developed to process data of these huge dimensionalities. This chapter discusses the components of the Big Data stack interface, categories of Big Data analytics software and platforms, descriptions of the top 20 Big Data analytics software. Big Data visualization techniques are discussed with real data from fatality analysis reporting system (FARS) managed by National Highway Traffic Safety Administration (NHTSA) of the United States Department of Transportation. Big Data web-based visualization software are discussed that are both JavaScript-based and user-interface-based. This chapter also discusses the challenges and opportunities of using Big Data and presents a flow diagram of the 30 chapters within this handbook.


Author(s):  
Kun Zhang ◽  
Biwei Huang ◽  
Jiji Zhang ◽  
Clark Glymour ◽  
Bernhard Schölkopf

It is commonplace to encounter nonstationary or heterogeneous data, of which the underlying generating process changes over time or across data sets (the data sets may have different experimental conditions or data collection conditions). Such a distribution shift feature presents both challenges and opportunities for causal discovery. In this paper we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from Nonstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independence changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. Experimental results on various synthetic and real-world data sets are presented to demonstrate the efficacy of our methods.


Data sizes have been growing exponentially within many companies. Facing this size of data—meta tagged piecemeal, produced in real-time, and arrives in continuous streams from multiple sources—analyzing the data to spot patterns and extract useful information is harder still. This includes the ever-changing landscape of data and their associated characteristics, evolving data analysis paradigms, challenges of computational infrastructure, data quality, complexity, and protection in addition to the data sharing and access, and—crucially—our ability to integrate data sets and their analysis toward an improved understanding. In this context, this second chapter will cover the issues and challenges that are hiding behind the 3Vs phenomenon. It gives a platform to complete the first chapter and proceed to different big data issues and challenges and how to tackle them in the dynamic processes.


Author(s):  
Ilias K. Savvas ◽  
Georgia N. Sofianidou ◽  
M-Tahar Kechadi

Big data refers to data sets whose size is beyond the capabilities of most current hardware and software technologies. The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. Huge collections of raw data require fast and accurate mining processes in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this study, the authors develop a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique prove its efficiency; thus, HDFS and MapReduce can apply to big data with very promising results.


Sign in / Sign up

Export Citation Format

Share Document