Choosing Among Regularized Estimators in Empirical Economics: The Risk of Machine Learning

Alberto Abadie; Maximilian Kasy

doi:10.1162/rest_a_00812

Choosing Among Regularized Estimators in Empirical Economics: The Risk of Machine Learning

Review of Economics and Statistics ◽

10.1162/rest_a_00812 ◽

2019 ◽

Vol 101 (5) ◽

pp. 743-762 ◽

Cited By ~ 8

Author(s):

Alberto Abadie ◽

Maximilian Kasy

Keyword(s):

Machine Learning ◽

Relative Performance ◽

Data Driven ◽

Regularized Estimation ◽

Empirical Economics ◽

Regularization Parameters ◽

Data Generating Process ◽

Selection Of

Many settings in empirical economics involve estimation of a large number of parameters. In such settings, methods that combine regularized estimation and data-driven choices of regularization parameters are useful. We provide guidance to applied researchers on the choice between regularized estimators and data-driven selection of regularization parameters. We characterize the risk and relative performance of regularized estimators as a function of the data-generating process and show that data-driven choices of regularization parameters yield estimators with risk uniformly close to the risk attained under the optimal (unfeasible) choice of regularization parameters. We illustrate using examples from empirical economics.

Download Full-text

Development of data-driven framework for automatically identifying patient cohorts from linked electronic health records

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.86 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Fabiola Fernández-Gutiérrez ◽

Jonathan Kennedy ◽

Roxanne Cooksey ◽

Mark Atkinson ◽

Ernest Choy ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Primary Care ◽

Secondary Care ◽

Data Driven ◽

Machine Learning Techniques ◽

Identification Algorithm ◽

Learning Context ◽

Predictive Values ◽

Selection Of

ABSTRACTObjectives 1) To develop a fully data-driven framework for automatically identifying patients with a condition from routine electronic primary care records; 2) to identify informative codes (risk factors) of arthropathy conditions in primary care records that can accurately predict a diagnosis of the conditions in secondary care records. ApproachThis study linked routine primary and secondary care records in Wales, UK held in the SAIL (Secured Anonymised Information Linkage) databank, in which the secondary care records were used as golden standard. As such, we proposed to use machine learning techniques to extract patient information and identify cohorts with a condition from the large and high-dimensional linked dataset using the following phases: data preparation, performed in the machine learning context fashion; pre-selection of initial features, ranking and selecting features into a meaningful subset by using feature selection methods; and identification algorithm development which incorporates mechanisms of tackling the imbalanced nature of the data. This data-driven framework was then validated on an independent dataset, and compared with existing algorithm which had been developed using expert clinician knowledge for arthropathy conditions. ResultsRheumatoid arthritis (RA) and ankylosing spondylitis (AS) were used to demonstrate the feasibility of this framework. Linking primary care records with the secondary care rheumatology clinical system, we collected 9,657 patients with 1,484 RA patients and 204 AS patients. The proposed framework identified various compact subsets of informative features (risk factors) from 43,100 potential Read codes. Applying to an independent test data, this framework achieved the classification accuracy and positive predictive values (PPVs) of 86.19% and 88.46% respectively for RA and 99.23 % and 97.75% respectively for AS, which are comparable with the performance of clinical knowledge-based method - the accuracy of 85.85%, the PPV of 85.28% for RA and the accuracy of 97.86% , the PPV of 95.65% for AS. ConclusionThe proposed data-driven framework provides a rapid and cost-effective way of reliably identifying patients with a medical condition from primary care data. It performed as well as the clinically derived algorithm. This framework does not intend to substitute clinical expertise, instead it provides an decision support tool for clinicians during their decision process, in particular selection of patients for clinical trials.

Download Full-text

Sparse Data–Driven Learning for Effective and Efficient Biomedical Image Segmentation

Annual Review of Biomedical Engineering ◽

10.1146/annurev-bioeng-060418-052147 ◽

2020 ◽

Vol 22 (1) ◽

pp. 127-153

Author(s):

John A. Onofrey ◽

Lawrence H. Staib ◽

Xiaojie Huang ◽

Fan Zhang ◽

Xenophon Papademetris ◽

...

Keyword(s):

Machine Learning ◽

Image Segmentation ◽

Deep Learning ◽

Computational Efficiency ◽

Medical Image ◽

Medical Image Segmentation ◽

Data Driven ◽

High Dimensional ◽

Biomedical Image ◽

Selection Of

Sparsity is a powerful concept to exploit for high-dimensional machine learning and associated representational and computational efficiency. Sparsity is well suited for medical image segmentation. We present a selection of techniques that incorporate sparsity, including strategies based on dictionary learning and deep learning, that are aimed at medical image segmentation and related quantification.

Download Full-text

Automated Selection of High-Quality Synthetic Images for Data-Driven Machine Learning: A Study on Traffic Signs

10.1109/iv48863.2021.9575337 ◽

2021 ◽

Author(s):

Daniela Horn ◽

Lars Janssen ◽

Sebastian Houben

Keyword(s):

Machine Learning ◽

Data Driven ◽

High Quality ◽

Traffic Signs ◽

Synthetic Images ◽

Selection Of

Download Full-text

Data Driven Smart Proxy for CFD Application of Big Data Analytics & Machine Learning in Computational Fluid Dynamics, Report Two: Model Building at the Cell Level

10.2172/1431303 ◽

2018 ◽

Cited By ~ 1

Author(s):

A. Ansari ◽

S. Mohaghegh ◽

M. Shahnam ◽

J. F. Dietiker ◽

T. Li

Keyword(s):

Machine Learning ◽

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Big Data ◽

Data Analytics ◽

Model Building ◽

Big Data Analytics ◽

Data Driven ◽

Cell Level

Download Full-text

Deep Learning in Disease Diagnosis: Models and Datasets

Current Bioinformatics ◽

10.2174/1574893615999201002124021 ◽

2020 ◽

Vol 15 ◽

Author(s):

Deeksha Saxena ◽

Mohammed Haris Siddiqui ◽

Rajnish Kumar

Keyword(s):

Biological Sciences ◽

Machine Learning ◽

Deep Learning ◽

Disease Diagnosis ◽

Learning Models ◽

Data Types ◽

Related Data ◽

Abstract Level ◽

Experimental Validations ◽

Selection Of

Background: Deep learning (DL) is an Artificial neural network-driven framework with multiple levels of representation for which non-linear modules combined in such a way that the levels of representation can be enhanced from lower to a much abstract level. Though DL is used widely in almost every field, it has largely brought a breakthrough in biological sciences as it is used in disease diagnosis and clinical trials. DL can be clubbed with machine learning, but at times both are used individually as well. DL seems to be a better platform than machine learning as the former does not require an intermediate feature extraction and works well with larger datasets. DL is one of the most discussed fields among the scientists and researchers these days for diagnosing and solving various biological problems. However, deep learning models need some improvisation and experimental validations to be more productive. Objective: To review the available DL models and datasets that are used in disease diagnosis. Methods: Available DL models and their applications in disease diagnosis were reviewed discussed and tabulated. Types of datasets and some of the popular disease related data sources for DL were highlighted. Results: We have analyzed the frequently used DL methods, data types and discussed some of the recent deep learning models used for solving different biological problems. Conclusion: The review presents useful insights about DL methods, data types, selection of DL models for the disease diagnosis.

Download Full-text

Incorporating radiomics into clinical trials: expert consensus on considerations for data-driven compared to biologically driven quantitative biomarkers

European Radiology ◽

10.1007/s00330-020-07598-8 ◽

2021 ◽

Author(s):

Laure Fournier ◽

Lena Costaridou ◽

Luc Bidaut ◽

Nicolas Michoux ◽

Frederic E. Lecouvet ◽

...

Keyword(s):

Clinical Trials ◽

Quantitative Imaging ◽

A Priori ◽

External Validation ◽

Data Driven ◽

Imaging Biomarkers ◽

Clinical Validation ◽

Biological Processes ◽

Imaging Data ◽

Selection Of

Abstract Existing quantitative imaging biomarkers (QIBs) are associated with known biological tissue characteristics and follow a well-understood path of technical, biological and clinical validation before incorporation into clinical trials. In radiomics, novel data-driven processes extract numerous visually imperceptible statistical features from the imaging data with no a priori assumptions on their correlation with biological processes. The selection of relevant features (radiomic signature) and incorporation into clinical trials therefore requires additional considerations to ensure meaningful imaging endpoints. Also, the number of radiomic features tested means that power calculations would result in sample sizes impossible to achieve within clinical trials. This article examines how the process of standardising and validating data-driven imaging biomarkers differs from those based on biological associations. Radiomic signatures are best developed initially on datasets that represent diversity of acquisition protocols as well as diversity of disease and of normal findings, rather than within clinical trials with standardised and optimised protocols as this would risk the selection of radiomic features being linked to the imaging process rather than the pathology. Normalisation through discretisation and feature harmonisation are essential pre-processing steps. Biological correlation may be performed after the technical and clinical validity of a radiomic signature is established, but is not mandatory. Feature selection may be part of discovery within a radiomics-specific trial or represent exploratory endpoints within an established trial; a previously validated radiomic signature may even be used as a primary/secondary endpoint, particularly if associations are demonstrated with specific biological processes and pathways being targeted within clinical trials. Key Points • Data-driven processes like radiomics risk false discoveries due to high-dimensionality of the dataset compared to sample size, making adequate diversity of the data, cross-validation and external validation essential to mitigate the risks of spurious associations and overfitting. • Use of radiomic signatures within clinical trials requires multistep standardisation of image acquisition, image analysis and data mining processes. • Biological correlation may be established after clinical validation but is not mandatory.

Download Full-text

Color Trend Analysis using Machine Learning with Fashion Collection Images

Clothing and Textiles Research Journal ◽

10.1177/0887302x21995948 ◽

2021 ◽

pp. 0887302X2199594

Author(s):

Ahyoung Han ◽

Jihoon Kim ◽

Jaehong Ahn

Keyword(s):

Machine Learning ◽

Trend Analysis ◽

Image Data ◽

Fashion Industry ◽

Color Palette ◽

Web Scraping ◽

Design Variables ◽

Sales Organizations ◽

Fashion Designers ◽

Selection Of

Fashion color trends are an essential marketing element that directly affect brand sales. Organizations such as Pantone have global authority over professional color standards by annually forecasting color palettes. However, the question remains whether fashion designers apply these colors in fashion shows that guide seasonal fashion trends. This study analyzed image data from fashion collections through machine learning to obtain measurable results by web-scraping catwalk images, separating body and clothing elements via machine learning, defining a selection of color chips using k-means algorithms, and analyzing the similarity between the Pantone color palette (16 colors) and the analysis color chips. The gap between the Pantone trends and the colors used in fashion collections were quantitatively analyzed and found to be significant. This study indicates the potential of machine learning within the fashion industry to guide production and suggests further research expand on other design variables.

Download Full-text

Machine learning approaches to understand and predict rate constants for organic processes in mixtures containing ionic liquids

Physical Chemistry Chemical Physics ◽

10.1039/d0cp04227g ◽

2021 ◽

Vol 23 (4) ◽

pp. 2742-2752

Author(s):

Tamar L. Greaves ◽

Karin S. Schaffarczyk McHale ◽

Raphael F. Burkart-Radke ◽

Jason B. Harper ◽

Tu C. Le

Keyword(s):

Machine Learning ◽

Ionic Liquids ◽

Rate Constants ◽

Learning Approaches ◽

Learning Models ◽

Organic Reaction ◽

Machine Learning Models ◽

Selection Of

Machine learning models were developed for an organic reaction in ionic liquids and validated on a selection of ionic liquids.

Download Full-text

Automated Data-Driven Generation of Personalized Pedagogical Interventions in Intelligent Tutoring Systems

International Journal of Artificial Intelligence in Education ◽

10.1007/s40593-021-00267-x ◽

2021 ◽

Author(s):

Ekaterina Kochmar ◽

Dung Do Vu ◽

Robert Belfer ◽

Varun Gupta ◽

Iulian Vlad Serban ◽

...

Keyword(s):

Machine Learning ◽

Student Performance ◽

Language Processing ◽

Intelligent Tutoring Systems ◽

Large Scale ◽

Intelligent Tutoring ◽

Performance Outcomes ◽

Data Driven ◽

Personalized Feedback ◽

Tutoring Systems

AbstractIntelligent tutoring systems (ITS) have been shown to be highly effective at promoting learning as compared to other computer-based instructional approaches. However, many ITS rely heavily on expert design and hand-crafted rules. This makes them difficult to build and transfer across domains and limits their potential efficacy. In this paper, we investigate how feedback in a large-scale ITS can be automatically generated in a data-driven way, and more specifically how personalization of feedback can lead to improvements in student performance outcomes. First, in this paper we propose a machine learning approach to generate personalized feedback in an automated way, which takes individual needs of students into account, while alleviating the need of expert intervention and design of hand-crafted rules. We leverage state-of-the-art machine learning and natural language processing techniques to provide students with personalized feedback using hints and Wikipedia-based explanations. Second, we demonstrate that personalized feedback leads to improved success rates at solving exercises in practice: our personalized feedback model is used in , a large-scale dialogue-based ITS with around 20,000 students launched in 2019. We present the results of experiments with students and show that the automated, data-driven, personalized feedback leads to a significant overall improvement of 22.95% in student performance outcomes and substantial improvements in the subjective evaluation of the feedback.

Download Full-text

Sports Ed 3.5: Establishing the value of data-driven sports development programs for universities through machine learning models

2020 The 8th International Conference on Information Technology: IoT and Smart City ◽

10.1145/3446999.3447009 ◽

2020 ◽

Author(s):

Edwin Mendoza Torralba

Keyword(s):

Machine Learning ◽

Data Driven ◽

Development Programs ◽

Learning Models ◽

Machine Learning Models

Download Full-text