Occam’s Razor for Big Data? On Detecting Quality in Large Unstructured Datasets

Dresp-Langley;  Ekseth;  Fesl;  Gohshi;  Kurz;  Sehring

doi:10.3390/app9153065

Occam’s Razor for Big Data? On Detecting Quality in Large Unstructured Datasets

Applied Sciences ◽

10.3390/app9153065 ◽

2019 ◽

Vol 9 (15) ◽

pp. 3065 ◽

Cited By ~ 2

Author(s):

Dresp-Langley ◽

Ekseth ◽

Fesl ◽

Gohshi ◽

Kurz ◽

...

Keyword(s):

Big Data ◽

Data Science ◽

Computation Time ◽

Video Data ◽

Machine Learning Algorithms ◽

Sensor Data ◽

Science Data ◽

Occam’S Razor ◽

Principle Of Parsimony ◽

Occam's Razor

Detecting quality in large unstructured datasets requires capacities far beyond the limits of human perception and communicability and, as a result, there is an emerging trend towards increasingly complex analytic solutions in data science to cope with this problem. This new trend towards analytic complexity represents a severe challenge for the principle of parsimony (Occam’s razor) in science. This review article combines insight from various domains such as physics, computational science, data engineering, and cognitive science to review the specific properties of big data. Problems for detecting data quality without losing the principle of parsimony are then highlighted on the basis of specific examples. Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time, and meaning can be extracted rapidly from large sets of unstructured image or video data parsimoniously through relatively simple unsupervised machine learning algorithms. Why we still massively lack in expertise for exploiting big data wisely to extract relevant information for specific tasks, recognize patterns and generate new information, or simply store and further process large amounts of sensor data is then reviewed, and examples illustrating why we need subjective views and pragmatic methods to analyze big data contents are brought forward. The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics, and the development of increasingly autonomous artificial intelligence (AI) aimed at coping with the big data deluge in the near future.

Download Full-text

Occam's Razor: From Ockham's via Moderna to Modern Data Science

Science Progress ◽

10.3184/003685018x15295002645082 ◽

2018 ◽

Vol 101 (3) ◽

pp. 261-272 ◽

Cited By ~ 4

Author(s):

Hugo A. Van Den Berg

Keyword(s):

Big Data ◽

Data Science ◽

William Of Ockham ◽

Occam’S Razor ◽

Data Explosion ◽

Principle Of Parsimony ◽

Occam's Razor

The principle of parsimony, also known as ‘Occam's razor’, is a heuristic dictum that is thoroughly familiar to virtually all practitioners of science: Aristotle, Newton, and many others have enunciated it in some form or other. Even though the principle is not difficult to comprehend as a general heuristic guideline, it has proved surprisingly resistant to being put on a rigorous footing – a difficulty that has become more pressing and topical with the ‘big data’ explosion. We review the significance of Occam's razor in the philosophical and theological writings of William of Ockham, and survey modern developments of parsimony in data science.

Download Full-text

A view from data science

Big Data & Society ◽

10.1177/20539517211040198 ◽

2021 ◽

Vol 8 (2) ◽

pp. 205395172110401

Author(s):

Anna Sapienza ◽

Sune Lehmann

Keyword(s):

Big Data ◽

Data Science ◽

Common Ground ◽

Science Data ◽

Social Scientists

For better and worse, our world has been transformed by Big Data. To understand digital traces generated by individuals, we need to design multidisciplinary approaches that combine social and data science. Data and social scientists face the challenge of effectively building upon each other’s approaches to overcome the limitations inherent in each side. Here, we offer a “data science perspective” on the challenges that arise when working to establish this interdisciplinary environment. We discuss how we perceive the differences and commonalities of the questions we ask to understand digital behaviors (including how we answer them), and how our methods may complement each other. Finally, we describe what a path toward common ground between these fields looks like when viewed from data science.

Download Full-text

The Challenge of Big Data and Data Science

Annual Review of Political Science ◽

10.1146/annurev-polisci-090216-023229 ◽

2019 ◽

Vol 22 (1) ◽

pp. 297-323 ◽

Cited By ~ 8

Author(s):

Henry E. Brady

Keyword(s):

Big Data ◽

Data Science ◽

Smart Cities ◽

Video Data ◽

Cyber Warfare ◽

Ethical Challenges ◽

Cyber Terrorism ◽

Social Scientists ◽

The Media ◽

Audio Video

Big data and data science are transforming the world in ways that spawn new concerns for social scientists, such as the impacts of the internet on citizens and the media, the repercussions of smart cities, the possibilities of cyber-warfare and cyber-terrorism, the implications of precision medicine, and the consequences of artificial intelligence and automation. Along with these changes in society, powerful new data science methods support research using administrative, internet, textual, and sensor-audio-video data. Burgeoning data and innovative methods facilitate answering previously hard-to-tackle questions about society by offering new ways to form concepts from data, to do descriptive inference, to make causal inferences, and to generate predictions. They also pose challenges as social scientists must grasp the meaning of concepts and predictions generated by convoluted algorithms, weigh the relative value of prediction versus causal inference, and cope with ethical challenges as their methods, such as algorithms for mobilizing voters or determining bail, are adopted by policy makers.

Download Full-text

Big Data Market Optimization Pricing Model Based on Data Quality

Complexity ◽

10.1155/2019/5964068 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Jian Yang ◽

Chongchong Zhao ◽

Chunxiao Xing

Keyword(s):

Big Data ◽

Data Quality ◽

Utility Function ◽

Data Science ◽

Machine Learning Algorithms ◽

Maximization Problem ◽

Pricing Model ◽

Model Based ◽

Data Market ◽

The Impact

In recent years, data has become a special kind of information commodity and promoted the development of information commodity economy through distribution. With the development of big data, the data market emerged and provided convenience for data transactions. However, the issues of optimal pricing and data quality allocation in the big data market have not been fully studied yet. In this paper, we proposed a big data market pricing model based on data quality. We first analyzed the dimensional indicators that affect data quality, and a linear evaluation model was established. Then, from the perspective of data science, we analyzed the impact of quality level on big data analysis (i.e., machine learning algorithms) and defined the utility function of data quality. The experimental results in real data sets have shown the applicability of the proposed quality utility function. In addition, we formulated the profit maximization problem and gave theoretical analysis. Finally, the data market can maximize profits through the proposed model illustrated with numerical examples.

Download Full-text

Heidelberg colorectal data set for surgical data science in the sensor operating room

Scientific Data ◽

10.1038/s41597-021-00882-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Lena Maier-Hein ◽

Martin Wagner ◽

Tobias Ross ◽

Annika Reinke ◽

Sebastian Bodenstedt ◽

...

Keyword(s):

Operating Room ◽

Data Science ◽

Medical Instrument ◽

Video Data ◽

Surgical Instruments ◽

Sensor Data ◽

Data Set ◽

Surgical Data Science ◽

Medical Instruments ◽

Surgical Data

AbstractImage-based tracking of medical instruments is an integral part of surgical data science applications. Previous research has addressed the tasks of detecting, segmenting and tracking medical instruments based on laparoscopic video data. However, the proposed methods still tend to fail when applied to challenging images and do not generalize well to data they have not been trained on. This paper introduces the Heidelberg Colorectal (HeiCo) data set - the first publicly available data set enabling comprehensive benchmarking of medical instrument detection and segmentation algorithms with a specific emphasis on method robustness and generalization capabilities. Our data set comprises 30 laparoscopic videos and corresponding sensor data from medical devices in the operating room for three different types of laparoscopic surgery. Annotations include surgical phase labels for all video frames as well as information on instrument presence and corresponding instance-wise segmentation masks for surgical instruments (if any) in more than 10,000 individual frames. The data has successfully been used to organize international competitions within the Endoscopic Vision Challenges 2017 and 2019.

Download Full-text

Big Data, Data Science, and Causal Inference: A Primer for Clinicians

Frontiers in Medicine ◽

10.3389/fmed.2021.678047 ◽

2021 ◽

Vol 8 ◽

Author(s):

Yoshihiko Raita ◽

Carlos A. Camargo ◽

Liming Liang ◽

Kohei Hasegawa

Keyword(s):

Machine Learning ◽

Big Data ◽

Causal Inference ◽

Precision Medicine ◽

Causal Reasoning ◽

Data Science ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Biological Knowledge ◽

The Right

Clinicians handle a growing amount of clinical, biometric, and biomarker data. In this “big data” era, there is an emerging faith that the answer to all clinical and scientific questions reside in “big data” and that data will transform medicine into precision medicine. However, data by themselves are useless. It is the algorithms encoding causal reasoning and domain (e.g., clinical and biological) knowledge that prove transformative. The recent introduction of (health) data science presents an opportunity to re-think this data-centric view. For example, while precision medicine seeks to provide the right prevention and treatment strategy to the right patients at the right time, its realization cannot be achieved by algorithms that operate exclusively in data-driven prediction modes, as do most machine learning algorithms. Better understanding of data science and its tasks is vital to interpret findings and translate new discoveries into clinical practice. In this review, we first discuss the principles and major tasks of data science by organizing it into three defining tasks: (1) association and prediction, (2) intervention, and (3) counterfactual causal inference. Second, we review commonly-used data science tools with examples in the medical literature. Lastly, we outline current challenges and future directions in the fields of medicine, elaborating on how data science can enhance clinical effectiveness and inform medical practice. As machine learning algorithms become ubiquitous tools to handle quantitatively “big data,” their integration with causal reasoning and domain knowledge is instrumental to qualitatively transform medicine, which will, in turn, improve health outcomes of patients.

Download Full-text

Earthquake Prediction using Machine Learning Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e9110.018620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 4684-4688

Keyword(s):

Machine Learning ◽

Structural Damage ◽

Data Science ◽

Learning Algorithm ◽

Economic Loss ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Science Data ◽

Data Set

Per the statistics received from BBC, data varies for every earthquake occurred till date. Approximately, up to thousands are dead, about 50,000 are injured, around 1-3 Million are dislocated, while a significant amount go missing and homeless. Almost 100% structural damage is experienced. It also affects the economic loss, varying from 10 to 16 million dollars. A magnitude corresponding to 5 and above is classified as deadliest. The most life-threatening earthquake occurred till date took place in Indonesia where about 3 million were dead, 1-2 million were injured and the structural damage accounted to 100%. Hence, the consequences of earthquake are devastating and are not limited to loss and damage of living as well as nonliving, but it also causes significant amount of change-from surrounding and lifestyle to economic. Every such parameter desiderates into forecasting earthquake. A couple of minutes’ notice and individuals can act to shield themselves from damage and demise; can decrease harm and monetary misfortunes, and property, characteristic assets can be secured. In current scenario, an accurate forecaster is designed and developed, a system that will forecast the catastrophe. It focuses on detecting early signs of earthquake by using machine learning algorithms. System is entitled to basic steps of developing learning systems along with life cycle of data science. Data-sets for Indian sub-continental along with rest of the World are collected from government sources. Pre-processing of data is followed by construction of stacking model that combines Random Forest and Support Vector Machine Algorithms. Algorithms develop this mathematical model reliant on “training data-set”. Model looks for pattern that leads to catastrophe and adapt to it in its building, so as to settle on choices and forecasts without being expressly customized to play out the task. After forecast, we broadcast the message to government officials and across various platforms. The focus of information to obtain is keenly represented by the 3 factors – Time, Locality and Magnitude.

Download Full-text

The role of data science in healthcare advancements: applications, benefits, and future prospects

Irish Journal of Medical Science (1971 -) ◽

10.1007/s11845-021-02730-z ◽

2021 ◽

Author(s):

Sri Venkat Gunturi Subrahmanya ◽

Dasharathraj K. Shetty ◽

Vathsala Patil ◽

B. M. Zeeshan Hameed ◽

Rahul Paul ◽

...

Keyword(s):

Data Mining ◽

Decision Making ◽

Big Data ◽

Data Analytics ◽

Data Science ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Unstructured Data ◽

Healthcare Applications ◽

Interdisciplinary Field

AbstractData science is an interdisciplinary field that extracts knowledge and insights from many structural and unstructured data, using scientific methods, data mining techniques, machine-learning algorithms, and big data. The healthcare industry generates large datasets of useful information on patient demography, treatment plans, results of medical examinations, insurance, etc. The data collected from the Internet of Things (IoT) devices attract the attention of data scientists. Data science provides aid to process, manage, analyze, and assimilate the large quantities of fragmented, structured, and unstructured data created by healthcare systems. This data requires effective management and analysis to acquire factual results. The process of data cleansing, data mining, data preparation, and data analysis used in healthcare applications is reviewed and discussed in the article. The article provides an insight into the status and prospects of big data analytics in healthcare, highlights the advantages, describes the frameworks and techniques used, briefs about the challenges faced currently, and discusses viable solutions. Data science and big data analytics can provide practical insights and aid in the decision-making of strategic decisions concerning the health system. It helps build a comprehensive view of patients, consumers, and clinicians. Data-driven decision-making opens up new possibilities to boost healthcare quality.

Download Full-text

Comment: The blunting of Occam's Razor, or to hell with parsimony

Canadian Journal of Zoology ◽

10.1139/z81-025 ◽

1981 ◽

Vol 59 (1) ◽

pp. 144-146 ◽

Cited By ~ 3

Author(s):

Kent E. Holsinger

Keyword(s):

Explanatory Power ◽

Occam’S Razor ◽

Methodological Tool ◽

Competing Hypotheses ◽

Principle Of Parsimony ◽

Occam's Razor

The principle of parsimony is a useful methodological tool in the choice between competing hypotheses if the hypotheses are of equal explanatory power. Its use is defended by the discussion of several examples, and a recent objection to its use is shown to be the result of a misinterpretation of the principle.

Download Full-text

Fundamentals of Wireless Sensor Networks Using Machine Learning Approaches

Deep Learning Strategies for Security Enhancement in Wireless Sensor Networks - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-5068-7.ch012 ◽

2020 ◽

pp. 233-254

Author(s):

E. B. Priyanka ◽

S. Thangavel ◽

D. Venkatesa Prabu

Keyword(s):

Machine Learning ◽

Big Data ◽

Real Time ◽

Oil And Gas ◽

Predictive Analytics ◽

Oil And Gas Industry ◽

Machine Learning Algorithms ◽

Sensor Data ◽

Learning Approaches ◽

Gas Industry

Big data and analytics may be new to some industries, but the oil and gas industry has long dealt with large quantities of data to make technical decisions. Oil producers can capture more detailed data in real-time at lower costs and from previously inaccessible areas, to improve oilfield and plant performance. Stream computing is a new way of analyzing high-frequency data for real-time complex-event-processing and scoring data against a physics-based or empirical model for predictive analytics, without having to store the data. Hadoop Map/Reduce and other NoSQL approaches are a new way of analyzing massive volumes of data used to support the reservoir, production, and facilities engineering. Hence, this chapter enumerates the routing organization of IoT with smart applications aggregating real-time oil pipeline sensor data as big data subjected to machine learning algorithms using the Hadoop platform.

Download Full-text